sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Retrieve (also) the CAP measure for each instance rather than just the overall score #531

Open yleniarotalinti opened 9 months ago

yleniarotalinti commented 9 months ago

Problem Description

I would like to know the privacy risk for each instance rather than just the overall CAP score.

Expected behavior

Implement a function that retrieves the privacy risk for each instance rather than just the overall CAP score. Or include that information as an attribute of the class that the user can store and use.

npatki commented 9 months ago

Hi @yleniarotalinti thanks for filing this feature request. We can keep this open and use it for tracking purposes whenever we make progress.

In your feature request, does "instance" refer to a row?

If so, you can achieve this by only passing in 1 row of real data into the CategorialCAP metric. You can then inspect and save a score for each row separately. (The average score of all rows is the overall score you are receiving with the full dataset.)

ROW_NUMBER = 0 
real_row = real_data.iloc[[ROW_NUMBER]]

row_score = CategoricalZeroCAP.compute(
    real_data=real_row,
    synthetic_data=synthetic_data,
    key_fields=<your list of key fields>,
    sensitive_fields=<your list of sensitive fields>
)

# TODO: loop through all possible row numbers

Let me know if this is an acceptable workaround or if there is some other measure you had in mind.

yleniarotalinti commented 9 months ago

Hi Neha, it is actually what I was looking for.

Thanks!