sdcTools / UserSupport

The place to be for User Support on SDC tools and to download the latest releases
https://sdctools.github.io/UserSupport/
Other
11 stars 3 forks source link

sdcMicroGUI settings not carried over correctly to R script #81

Closed bachfab closed 6 years ago

bachfab commented 6 years ago

Please specify
SDC tool used: sdcMicroGUI Version used: 5.0.5 Operating system used: Windows


Inside the GUI, when I do

Anonymize -> Apply k-anonymity to subsets of key variables? -> Yes -> Apply k-anonimity to all subsets of 4 key variables? -> Yes -> threshold: 10

the following line is exported to the R script under "Reproducibility":

sdcObj <- kAnon(sdcObj, importance=c(7,12,1,3,13,2,11,5,9,10,6,4,8), combs=c(8), k=c(10))

However, if I understood the method and kAnon arguments correctly, it should read combs=c(4).

(I loaded 100 test records with 13 key variables.)

Btw., there is a typo in the GUI tab "Anonymize", namely in the particular setting used above: when selecting "Apply k-anonymity to subsets of key variables?" -> Yes, in the subsequent expanded lines it should read k-anonymity instead of k-anonimity.

bachfab commented 6 years ago

Also the importance=... argument does not correctly reflect the priorities assigned in the GUI under Anonymity -> k-anonymity. This is under the assumption that the order of prio values in the importance vector should reflect the order of key variables in the keyVars vector.

bernhard-da commented 6 years ago

hi @bachfab

thx for reporting. i fixed the typo but can't reproduce the remaining issue. for me it states as expected:

sdcObj <- kAnon(sdcObj, importance=c(1,4,5,3,6,2), combs=c(4), k=c(10))

you mentioned you were using some dummy test-data. could you export the problem instance just before trying to establish k-anonymity in Reproducibility -> Export/Save the current sdcProblem and link to this file somewhere?

as to your second comment. if you're not changing the importance directly, we prefer to suppress values in the key variable with the most characteristics (highest importance) to the lowest and ignore the "order" of their occurence in the data set.

bachfab commented 6 years ago

Hi Bernhard,

Yes I said "dummy" – but actually it was real microdata, just taken some 100 records instead of a full dataset. I'm checking with Aleksandra now how to produce an example case… Meanwhile 2 more things:

  1. Thank you for sharing the snapshot – I confirm it works well now!

  2. I realized that writeSafeFile used with format="sas" produces .sas7bdat files that give an error when trying to open them with SAS EPG (and when trying to use them as input to another SAS program). I also confirm the problem's already there with the write_sas function from the haven lib apparently used by sdcMicro (and Google indicates it's already known for haven), hence no idea if you want to make this one of your own issues…

All the best, Fabian

From: bernhard-da [mailto:notifications@github.com] Sent: Monday, December 18, 2017 9:16 PM To: sdcTools/UserSupport Cc: BACH Fabian (ESTAT); Mention Subject: Re: [sdcTools/UserSupport] sdcMicroGUI settings not carried over correctly to R script (#81)

hi @bachfabhttps://github.com/bachfab

thx for reporting. i fixed the typo but can't reproduce the remaining issue. for me it states as expected:

sdcObj <- kAnon(sdcObj, importance=c(1,4,5,3,6,2), combs=c(4), k=c(10))

you mentioned you were using some dummy test-data. could you export the problem instance just before trying to establish k-anonymity in Reproducibility -> Export/Save the current sdcProblem and link to this file somewhere?

as to your second comment. if you're not changing the importance directly, we prefer to suppress values in the key variable with the most characteristics (highest importance) to the lowest and ignore the "order" of their occurence in the data set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sdcTools/UserSupport/issues/81#issuecomment-352545356, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AflOwUfu6xXAjFLp6NHhDrO-tv2WhCSFks5tBsgEgaJpZM4RFlLK.

bernhard-da commented 6 years ago

@bachfab ok, you could try to reproduce the problem with some kind of dataset you create (eg using random numbers) and which you can easily share

as for the second issue: no, this is needs to be fixed in haven::write_sas

bachfab commented 6 years ago

As announced a minute ago:

DummyTestProblemForBernhard

Best, Fabian

From: bernhard-da [mailto:notifications@github.com] Sent: Monday, December 18, 2017 9:16 PM To: sdcTools/UserSupport Cc: BACH Fabian (ESTAT); Mention Subject: Re: [sdcTools/UserSupport] sdcMicroGUI settings not carried over correctly to R script (#81)

hi @bachfabhttps://github.com/bachfab

thx for reporting. i fixed the typo but can't reproduce the remaining issue. for me it states as expected:

sdcObj <- kAnon(sdcObj, importance=c(1,4,5,3,6,2), combs=c(4), k=c(10))

you mentioned you were using some dummy test-data. could you export the problem instance just before trying to establish k-anonymity in Reproducibility -> Export/Save the current sdcProblem and link to this file somewhere?

as to your second comment. if you're not changing the importance directly, we prefer to suppress values in the key variable with the most characteristics (highest importance) to the lowest and ignore the "order" of their occurence in the data set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sdcTools/UserSupport/issues/81#issuecomment-352545356, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AflOwUfu6xXAjFLp6NHhDrO-tv2WhCSFks5tBsgEgaJpZM4RFlLK.

bernhard-da commented 6 years ago

@bachfab thx for providing the inputs. i've identified and fixed the issue in the next-branch. the problem did only occur however, if >= 10 cat. key variables have been specified.