sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 23 forks source link

Calculation of SUDA scores #227

Closed RaffaelBild closed 7 years ago

RaffaelBild commented 7 years ago

On page 30 in the book "Statistical Disclosure Control for Microdata: A Practice Guide" which can be accessed at http://www.ihsn.org/projects/sdc-practice an example illustrating the calculation of SUDA scores is presented (Table 4.5).

I tried to reproduce the example using the input

Residence;Gender;Education_level;Labor_status
Urban;Female;Secondary_incomplete;Employed
Urban;Female;Secondary_incomplete;Employed
Urban;Female;Primary_incomplete;Non-LF
Urban;Male;Secondary_complete;Employed
Rural;Female;Secondary_complete;Unemployed
Urban;Male;Secondary_complete;Employed
Urban;Female;Primary_complete;Non-LF
Urban;Male;Post-secondary;Unemployed
Urban;Female;Secondary_incomplete;Non-LF
Urban;Female;Secondary_incomplete;Non-LF

and the following R code:

require(sdcMicro)
tab <- readMicrodata(path="C:/temp/input.csv", type="csv", header=TRUE, sep=";")
su <- suda2(tab)
su$score

I have obtained the following output:

0.00 0.00 1.75 0.00 3.25 0.00 1.75 2.75 0.00 0.00

However, the example in the book reports the SUDA scores

0.00 0.00 6.00 0.00 12.00 0.00 6.00 10.00 0.00 0.00

As can be seen, all scores which are greater than zero differ. In particular, sdcMicro reports a SUDA score of 3.25 for record number 5 (Rural, Female, Secondary_complete, Unemployed) rather than the value of 12 reported in the book.

The computation of the value 12 is exemplary explained in the book as follows: Record number 5 contains four MSUs, one of size one, and three of size two [1]. For the MSU of size one, a score of (#attributes - 1) * (#attributes - 2) * (#attributes - 3) = 3 * 2 * 1 = 6 is computed. For each MSU of size two, a score of (#attributes - 2) * (#attributes - 3) = 2 * 1 = 2 is calculated. The SUDA score of record number 5 is then the sum of the scores of each MSU it contains: 6 + 2 + 2 + 2 = 12 This way of computing SUDA scores is also described in the paper "SUDA: A Program for Detecting Special Uniques" by Elliot et al.

I'd really appreciate it if someone could explain how the computation in sdcMicro differs from this, so that e.g. a SUDA score of 3.25 is obtained for record number 5.

[1] The MSUs contained in record 5 are {Rural}, {Secondary_complete, Unemployed}, {Female, Unemployed} and {Female, Secondary_complete}

thijsbenschop commented 7 years ago

In the Elliot et al. 2005 paper "SUDA: A Program for Detecting Special Uniques" the SUDA score for an MSU of size i is defined as

schermafbeelding 2017-07-19 om 12 13 33

This yields the following contributions to the SUDA score for MSUs of sizes 1 to 3: schermafbeelding 2017-07-19 om 12 15 45

In sdcMicro the contribution of an MSU to the SUDA score is defined by the C++ code in the file Suda2.h. The score for an MSU of size i is computed in lines 1091-1116.

This boils down to: schermafbeelding 2017-07-19 om 12 42 10

where n is the number of variables SUDA is computed on (ATT in the above computation) and the products evaluate to 1 if no j or k is in the range specified by the respective product and is different from the definition in Elliot et al.

If we repeat the compuation of the SUDA scores for the example on page 30 of "Statistical Disclosure Control for Microdata: A Practice Guide", the SUDA score contributions per MSU of size i are as follows (in this case n = 4, as we use four variables to compute the SUDA scores):

schermafbeelding 2017-07-19 om 12 47 11

This yields the scores as computed in the guide for records 3, 5, 7 and 8: schermafbeelding 2017-07-19 om 12 14 04

This means that sdcMicro finds the same MSUs as with the Elliot algorithm, but computes the score contribution of an MSU of size i differently.

bernhard-da commented 7 years ago

many thx @thijsbenschop for answering this!

we'll probably update the underlying computation in the future so that users may opt to get the same results as in the elliot-paper.

bernhard-da commented 7 years ago

suda2() gained an optional argument original_scores. If this is set to TRUE, the computation of the suda scores will be just as described in the original paper by elliot, if FALSE (currently the default), the computation of the scores is as previously done in sdcMicro.

any feedback on this is very much appreciated.

RaffaelBild commented 7 years ago

Thanks a lot for the detailed explanation @thijsbenschop! This makes the computations performed by sdcMicro completely clear indeed.

And many thanks @bernhard-da for implementing the computation as described by by Elliot et al. This is a valuable extension of sdcMicro in my opinion.

thijsbenschop commented 7 years ago

@bernhard-da, thank you for adding the Elliot method too. In my opinion, the Elliot method should be the default. I would rather remove the current default method, if it isn't described in the literature. In all the literature on SUDA and DIS-SUDA that I'm aware of, the Elliot method is used.

bernhard-da commented 7 years ago

@matthias-da I'd propose to set the default value of original_scores in suda2() in a way that the results equal those from the original elliot paper. any objections?

matthias-da commented 7 years ago

thanks for all this. I have no objections on it, both is fine with me (default to Elliot or not). Best, Matthias

On Wed, Jul 19, 2017 at 8:40 PM, bernhard-da notifications@github.com wrote:

@matthias-da https://github.com/matthias-da I'd propose to set the default value of original_scores in suda2() in a way that the results equal those from the original elliot paper. any objections?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sdcTools/sdcMicro/issues/227#issuecomment-316478937, or mute the thread https://github.com/notifications/unsubscribe-auth/AFBRRZaMocFTpYXXkQWOleXDP8RGNEK8ks5sPk2jgaJpZM4ObiKH .

-- PD Dr.techn. Matthias Templ Zurich University of Applied Sciences (ZHAW) Institute of Data Analysis and Process Design (IDP) Rosenstrasse 3, CH-8401 Winterthur http://www.idp.zhaw.ch E-Mail: matthias.templ@zhaw.ch

bernhard-da commented 7 years ago

@matthias-da thx, default will be elliot in next-version.