statisticsnorway / ssb-gausssuppression

R-package to protect tables by suppression using the Gaussian elimination algorithm
https://statisticsnorway.github.io/ssb-gausssuppression/
Other
3 stars 2 forks source link

Bug: PrimaryDefault and NContributorsRule not working when respective maxN parameters differ (in some cases) #120

Closed jnlb closed 1 week ago

jnlb commented 4 weeks ago

This issue is regarding the GaussSuppressionFromData function.

A typical cell suppression case may include applying primary suppressions via the frequency rule for both statistical units and with regard to some contributor ID (contributorVar or charVar). Sometimes the threshold maxN will differ for these respective rules. Examples from the documentation suggest supplying the maxN parameter as a named list, e.g. maxN = list(charVar1 = 3, charVar2 = 5).

The problem is that when used in tandem with NContributors, the normal frequency rule may not be registered. This is not a big issue, since a custom frequency check is extremely easy to program, but it is still a bug that should be looked into. Especially since the suppression pattern behaves in an extremely odd way. Here is a minimal example:

library(SSBtools)
library(GaussSuppression)

dataset <- SSBtoolsData("magnitude1")

out1 <- GaussSuppressionFromData(
  data = dataset,
  numVar = "value",
  freqVarNew = "freq",
  dimVar = c("sector4", "geo"),
  charVar = "company",
  maxN = list(freq = 2, company = 0),
  primary = c(PrimaryDefault, NContributorsRule),
  protectZeros = FALSE,
  preAggregate = TRUE,
  singletonMethod = "none"
)

The above code outputs the following suppression pattern:

> out1
         sector4      geo freq value nRule nAll primary suppressed
1          Total    Total   20 462.3     4    4   FALSE      FALSE
2          Total  Iceland    4  37.1     3    3   FALSE      FALSE
3          Total Portugal    8 162.5     3    3   FALSE      FALSE
4          Total    Spain    8 262.7     4    4   FALSE      FALSE
5    Agriculture    Total    4 240.2     2    2    TRUE       TRUE
6    Agriculture  Iceland    0   0.0     0    0   FALSE      FALSE
7    Agriculture Portugal    2 100.4     2    2    TRUE       TRUE
8    Agriculture    Spain    2 139.8     2    2   FALSE      FALSE
9  Entertainment    Total    6 131.5     4    4   FALSE      FALSE
10 Entertainment  Iceland    1  16.8     1    1   FALSE       TRUE
11 Entertainment Portugal    2   9.4     2    2    TRUE       TRUE
12 Entertainment    Spain    3 105.3     3    3   FALSE      FALSE
13  Governmental    Total    4  32.8     3    3   FALSE       TRUE
14  Governmental  Iceland    0   0.0     0    0   FALSE      FALSE
15  Governmental Portugal    2  23.6     2    2    TRUE       TRUE
16  Governmental    Spain    2   9.2     2    2   FALSE      FALSE
17      Industry    Total    6  57.8     3    3   FALSE      FALSE
18      Industry  Iceland    3  20.3     3    3   FALSE       TRUE
19      Industry Portugal    2  29.1     2    2    TRUE       TRUE
20      Industry    Spain    1   8.4     1    1   FALSE      FALSE

This is clearly not correct, just look at the rows with frequency 1.

Here is what happens when you set maxN = 1 for frequencies.

out3 <- GaussSuppressionFromData(
  data = dataset,
  numVar = "value",
  freqVarNew = "freq",
  dimVar = c("sector4", "geo"),
  charVar = "company",
  maxN = list(freq = 1, company = 0),
  primary = c(PrimaryDefault, NContributorsRule),
  protectZeros = FALSE,
  preAggregate = TRUE,
  singletonMethod = "none"
)
> out3
         sector4      geo freq value nRule nAll primary suppressed
1          Total    Total   20 462.3     4    4   FALSE      FALSE
2          Total  Iceland    4  37.1     3    3   FALSE      FALSE
3          Total Portugal    8 162.5     3    3   FALSE      FALSE
4          Total    Spain    8 262.7     4    4   FALSE      FALSE
5    Agriculture    Total    4 240.2     2    2   FALSE      FALSE
6    Agriculture  Iceland    0   0.0     0    0   FALSE      FALSE
7    Agriculture Portugal    2 100.4     2    2   FALSE      FALSE
8    Agriculture    Spain    2 139.8     2    2   FALSE      FALSE
9  Entertainment    Total    6 131.5     4    4   FALSE      FALSE
10 Entertainment  Iceland    1  16.8     1    1   FALSE      FALSE
11 Entertainment Portugal    2   9.4     2    2   FALSE      FALSE
12 Entertainment    Spain    3 105.3     3    3   FALSE      FALSE
13  Governmental    Total    4  32.8     3    3   FALSE      FALSE
14  Governmental  Iceland    0   0.0     0    0   FALSE      FALSE
15  Governmental Portugal    2  23.6     2    2   FALSE      FALSE
16  Governmental    Spain    2   9.2     2    2   FALSE      FALSE
17      Industry    Total    6  57.8     3    3   FALSE      FALSE
18      Industry  Iceland    3  20.3     3    3   FALSE      FALSE
19      Industry Portugal    2  29.1     2    2   FALSE      FALSE
20      Industry    Spain    1   8.4     1    1   FALSE      FALSE

In short, the expected behaviour should be what this following function call outputs.

out6 <- GaussSuppressionFromData(
  data = dataset,
  numVar = "value",
  freqVarNew = "freq",
  dimVar = c("sector4", "geo"),
  charVar = "company",
  maxN = list(freq = 1, company = 0),
  primary = c(function(freq, maxN, protectZeros, ...) {
    primary <- freq <= maxN$freq
    if (!protectZeros) 
      primary[freq == 0] <- FALSE

    primary
  },
  NContributorsRule 
  ),
  protectZeros = FALSE,
  preAggregate = TRUE,
  singletonMethod = "none"
)
> out6
         sector4      geo freq value nRule nAll primary suppressed
1          Total    Total   20 462.3     4    4   FALSE      FALSE
2          Total  Iceland    4  37.1     3    3   FALSE      FALSE
3          Total Portugal    8 162.5     3    3   FALSE      FALSE
4          Total    Spain    8 262.7     4    4   FALSE      FALSE
5    Agriculture    Total    4 240.2     2    2   FALSE      FALSE
6    Agriculture  Iceland    0   0.0     0    0   FALSE      FALSE
7    Agriculture Portugal    2 100.4     2    2   FALSE      FALSE
8    Agriculture    Spain    2 139.8     2    2   FALSE      FALSE
9  Entertainment    Total    6 131.5     4    4   FALSE      FALSE
10 Entertainment  Iceland    1  16.8     1    1    TRUE       TRUE
11 Entertainment Portugal    2   9.4     2    2   FALSE       TRUE
12 Entertainment    Spain    3 105.3     3    3   FALSE      FALSE
13  Governmental    Total    4  32.8     3    3   FALSE      FALSE
14  Governmental  Iceland    0   0.0     0    0   FALSE      FALSE
15  Governmental Portugal    2  23.6     2    2   FALSE       TRUE
16  Governmental    Spain    2   9.2     2    2   FALSE       TRUE
17      Industry    Total    6  57.8     3    3   FALSE      FALSE
18      Industry  Iceland    3  20.3     3    3   FALSE       TRUE
19      Industry Portugal    2  29.1     2    2   FALSE      FALSE
20      Industry    Spain    1   8.4     1    1    TRUE       TRUE

Tested on an environment with:

I have made no progress so far on figuring out what causes this issue.

olangsrud commented 4 weeks ago

The problem is that the parameter name maxN is the same in two functions with different meanings. So another workaround is to change one of them.

PrimaryDefault_ <- function(maxN_ = 3, maxN, ...) {
  PrimaryDefault(maxN = maxN_, ...)
}

GaussSuppressionFromData(
  data = dataset,
  numVar = "value",
  freqVarNew = "freq",
  dimVar = c("sector4", "geo"),
  charVar = "company",
  maxN_ = 1, 
  maxN = 0,
  primary = c(PrimaryDefault_, NContributorsRule),
  protectZeros = FALSE,
  preAggregate = TRUE,
  singletonMethod = "none"
)

But since that's the case (same parameter name), I agree that there should be a change that avoids the workaround. Thank you.

olangsrud commented 3 weeks ago

The improvement is now available in the latest CRAN release.
I believe it should now work as intended.
See NEWS.

jnlb commented 1 week ago

I have now familiarised myself with the CRAN release and checked that the new feature works as I had hoped. Thank you for the very fast reaction!

Since no issue is left to address here, I will close this one with this comment.