sdcTools / UserSupport

The place to be for User Support on SDC tools and to download the latest releases
https://sdctools.github.io/UserSupport/
Other
11 stars 4 forks source link

sdcTable: Issue with approximate disclosure in suppressed cells #160

Open KarinaKelleher opened 3 years ago

KarinaKelleher commented 3 years ago

SDC too used: sdcTable Version Used : 0.31 Operating system used: Windows 10

My colleague, Tim Linehan, and I, came across this when testing different suppression methods on Business data.

Below we replicate the scenario using synthetic data.

Issue Details Some 2-digit NACE groups were primary suppressed due to the (2,80) dominance rule. However, in these suppressed NACE groups (B05, B06 and B07), the largest EN is still greater than 80% / 90% of their combined total and so requires some form of additional secondary suppression to avoid approximate disclosure. B08 and B09 are the 2 NACE groups remaining which may be selected for suppression. B08 ideally should be selected. If B09 is suppressed then the dominant enterprise is still contributing more than 85% of the combined turnover of the suppressed groups.

This issue is currently tested for and picked up by a bespoke SAS program in use in the relevant Business area.

While we would not expect the Simplehuristic method to pick this up it would be expected that the Hypercube and perhaps HITAS methods would.

Sample data In the accompanying sample synthetic data (samp_synth_data.csv - note: saved as a .xlsx file for the purposes of loading on this page) we have the following:

• Three NACE groups (B05, B06 and B07) at 2-digit NACE level which require primary suppression due to (2,80) dominance rule.

• Turnover for EN00000035 > 90% of combined turnover of the suppressed groups B05, B06 and B07 (i.e. EUR99500 > 90% of EUR110350)

image

Results The following is a summary of the results we would expect to get and the results we got when using the different sdcTable secondary suppression methods:

image

Conclusion Approximate disclosure, given the above scenario, would still be an issue following primary and secondary suppression (above methods) using sdcTable.

samp_synth_data.xlsx

ppdewolf commented 3 years ago

This is what I get using Modular (=HiTaS) in tau-argus 4.2.0b5 (red numbers are primary suppressions, blue numbers are secondary suppressions). Not sure what HiTaS implementation is used in sdcTable. Could you share the full R-script you are using to apply HITAS to your synthetic data example? afbeelding

bernhard-da commented 3 years ago

hi @KarinaKelleher thx for sharing this. first off; as stated in the notes of the protectTable() man-page, I would not suggest to use HITAS, HYPERCUBE (OPT) in production as they contain old, mostly untested code; in this case it is better to create the inputs for argus via createArgusInput() and solve with argus;

as for "SIMPLEHEURISTIC": as you already indicated, the algorithm is based on cell-frequencies/weights and does only protect against exact disclosure; the algorithm is stupid in the sense that it does not "know" anything about dominance-rules such as in this case; I would be however open to suggestions on how to parametrize this feature; It could done with an additional parameter like protectTable(..., ensureDominance=c(type = "nk", n = 2, k = 80)) which - if specified - would find additional required suppressions in a way that the singleton-detection procedure already does.

But I have to think more about it. In any case, as @ppdewolf mentioned, it would be great if you could share a fully reproducable example.

KarinaKelleher commented 3 years ago

Hi, Thanks you both for the replies - Yes, as noted above, the relevant cells are suppressed when using TauArgus. However they are not suppressed in sdcTable when the following was used (or also when HYPERCUBE and SIMPLEHEURISTIC used):

image image image image

And for completeness I'm also attaching the final file with SDC applied:

SAMP_sdcapplied_HITAS.xlsx