Closed RaffaelBild closed 7 years ago
In the Elliot et al. 2005 paper "SUDA: A Program for Detecting Special Uniques" the SUDA score for an MSU of size i is defined as
This yields the following contributions to the SUDA score for MSUs of sizes 1 to 3:
In sdcMicro the contribution of an MSU to the SUDA score is defined by the C++ code in the file Suda2.h. The score for an MSU of size i is computed in lines 1091-1116.
This boils down to:
where n is the number of variables SUDA is computed on (ATT in the above computation) and the products evaluate to 1 if no j or k is in the range specified by the respective product and is different from the definition in Elliot et al.
If we repeat the compuation of the SUDA scores for the example on page 30 of "Statistical Disclosure Control for Microdata: A Practice Guide", the SUDA score contributions per MSU of size i are as follows (in this case n = 4, as we use four variables to compute the SUDA scores):
This yields the scores as computed in the guide for records 3, 5, 7 and 8:
This means that sdcMicro finds the same MSUs as with the Elliot algorithm, but computes the score contribution of an MSU of size i differently.
many thx @thijsbenschop for answering this!
we'll probably update the underlying computation in the future so that users may opt to get the same results as in the elliot-paper.
suda2()
gained an optional argument original_scores
. If this is set to TRUE
, the computation of the suda scores will be just as described in the original paper by elliot, if FALSE
(currently the default), the computation of the scores is as previously done in sdcMicro.
any feedback on this is very much appreciated.
Thanks a lot for the detailed explanation @thijsbenschop! This makes the computations performed by sdcMicro completely clear indeed.
And many thanks @bernhard-da for implementing the computation as described by by Elliot et al. This is a valuable extension of sdcMicro in my opinion.
@bernhard-da, thank you for adding the Elliot method too. In my opinion, the Elliot method should be the default. I would rather remove the current default method, if it isn't described in the literature. In all the literature on SUDA and DIS-SUDA that I'm aware of, the Elliot method is used.
@matthias-da I'd propose to set the default value of original_scores
in suda2()
in a way that the results equal those from the original elliot paper. any objections?
thanks for all this. I have no objections on it, both is fine with me (default to Elliot or not). Best, Matthias
On Wed, Jul 19, 2017 at 8:40 PM, bernhard-da notifications@github.com wrote:
@matthias-da https://github.com/matthias-da I'd propose to set the default value of original_scores in suda2() in a way that the results equal those from the original elliot paper. any objections?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sdcTools/sdcMicro/issues/227#issuecomment-316478937, or mute the thread https://github.com/notifications/unsubscribe-auth/AFBRRZaMocFTpYXXkQWOleXDP8RGNEK8ks5sPk2jgaJpZM4ObiKH .
-- PD Dr.techn. Matthias Templ Zurich University of Applied Sciences (ZHAW) Institute of Data Analysis and Process Design (IDP) Rosenstrasse 3, CH-8401 Winterthur http://www.idp.zhaw.ch E-Mail: matthias.templ@zhaw.ch
@matthias-da thx, default will be elliot in next-version.
On page 30 in the book "Statistical Disclosure Control for Microdata: A Practice Guide" which can be accessed at http://www.ihsn.org/projects/sdc-practice an example illustrating the calculation of SUDA scores is presented (Table 4.5).
I tried to reproduce the example using the input
and the following R code:
I have obtained the following output:
0.00 0.00 1.75 0.00 3.25 0.00 1.75 2.75 0.00 0.00
However, the example in the book reports the SUDA scores
0.00 0.00 6.00 0.00 12.00 0.00 6.00 10.00 0.00 0.00
As can be seen, all scores which are greater than zero differ. In particular, sdcMicro reports a SUDA score of 3.25 for record number 5 (Rural, Female, Secondary_complete, Unemployed) rather than the value of 12 reported in the book.
The computation of the value 12 is exemplary explained in the book as follows: Record number 5 contains four MSUs, one of size one, and three of size two [1]. For the MSU of size one, a score of
(#attributes - 1) * (#attributes - 2) * (#attributes - 3) = 3 * 2 * 1 = 6
is computed. For each MSU of size two, a score of(#attributes - 2) * (#attributes - 3) = 2 * 1 = 2
is calculated. The SUDA score of record number 5 is then the sum of the scores of each MSU it contains:6 + 2 + 2 + 2 = 12
This way of computing SUDA scores is also described in the paper "SUDA: A Program for Detecting Special Uniques" by Elliot et al.I'd really appreciate it if someone could explain how the computation in sdcMicro differs from this, so that e.g. a SUDA score of 3.25 is obtained for record number 5.
[1] The MSUs contained in record 5 are {Rural}, {Secondary_complete, Unemployed}, {Female, Unemployed} and {Female, Secondary_complete}