sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Questions about CAP scores #532

Closed limhasic closed 7 months ago

limhasic commented 9 months ago

[1] CAP score at https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/categoricalcap#score [2] Taub, J. and Eliot, M. (2019). The synthetic data challenge. UNECE: Conference of European Statisticians. Does this mean different from TCAP?

If the meaning is the same, the CAP score in [1] is "We repeat the attack for all rows (r) in the real data and calculate an overall probability of guessing the sensitive column correctly. The metric returns 1-probability so that a higher score means "higher privacy." 1- Should I look at it as TCAP?

Also, are there any criteria for which column to select for cap score?

npatki commented 7 months ago

Hi @limhasic,

From the references that are listed for the CategoricalCAP metric: My understanding is that the metric was created using the original definition of CAP, from reference [1]. This looks for exact matches of the key fields. Meanwhile, in [2] it is generalized so that if no matches are found, approximate matches are deemed ok.

Whatever matches are found, the metric iterate through each row and reports the overall average of all rows. I have not read the full paper for TCAP -- but if it is doing the same thing, it would appear to be similar to CAP.

Also, are there any criteria for which column to select for cap score?

Selecting columns as key or sensitive fields would be dependent on your threat model. Any kind of CAP metric requires you to assume that an attacker may have access to certain types of columns. Perhaps these values are available in publicly available datasets, or perhaps there was a data leak in the past, etc. It seems project dependent to me.

Hope that helps.

limhasic commented 7 months ago

Thank you for your kind reply