Beginner questions about cellkey

tombisho commented 2 years ago

Thank you for the excellent cellkey (0.19.1, linux) package. It has been really useful for starting to understand some of the concepts.

I am wondering about how it should be used to calculate other summary statistics beyond counts and sums. I guess means would be a good start. Is is acceptable to simply divide a sum table by a count table? Or do we need to create a new method for a cellkey object called meantab?

To give you some context about why I am asking, we have been using and developing DataSHIELD as a way of analysing across datasets without have access to microdata. The data so far is not as sensitive as census data so the disclosure control methods are not quite as strong at the moment. For example, it has limitations minimum number of contributing values to a cell, minimum in a subset etc. DataSHIELD also gives quite a lot of flexibility to the users to manipulate the data, which in turn of course makes it more vulnerable to attack, in particular through differencing. This is because the user can dynamically define arbitrary subsets of the data. It would seem that the cellkey method can help us here and it would be great to somehow integrate it as this would add to the overall protection offered by DataSHIELD.

However, I am struggling to see how it can be implemented practically because:

We would need to build cellkey objects whenever a subset is created.
Each data host would need to define the hierarchies and suitable parameters for the perturbation tables which surely requires some expertise

bernhard-da commented 2 years ago

@tombisho thx for your question;

I guess you are right; if you plan the application to use as you described, there is no way around building new objects for each "query" (subset/hierarchy-definition) using the same data-source (to which a record-key has been applied once in order to keep consistency);

but i am not sure if the cell-key method is really the appropriate tool for your task; its real intended use (i guess) is to create a single perturbed (large) base-dataset (with a set of given hierarchies) from which a lot of different (but in a way pre-fixed due to the given specs of the hierarchies) tables can be extracted;

for a more general approach, you could try to apply the recordSwapping package and perturb the underlying microdata itself. if enough noise is applied to the micro data itself, you could consider each table (independent of the filtering/subsetting or hierarchies applied) to be protected;

hope these comments are somewhat helpful...

tombisho commented 2 years ago

Hi @bernhard-da

Yes this is some very useful feedback. I can see that the cell-key method is more aimed at the one-shot data release, where some careful preparatory work needs to be done first. If I have time I guess I could explore to what extent the preparation steps could be automated, but this may just be unrealistic. The appeal is that the cellkey package offers a nice way of consistent applying perturbations.

I had wondered about just perturbing the microdata (plus record swapping). The challenge there is again having to define the amount of perturbation that is suitable. And applying too much perturbation reduces the utility of the data.

I did have one question that it might be possible to answer: is it acceptable to simply divide a sum table by a count table (over the same variables) to generate a mean table? Or do we need to create a new method for a cellkey object called meantab?

Thanks again

ppdewolf commented 2 years ago

To add a little to this discussion:

The basic idea of CKM is indeed to add record keys once to the microdata and keep those fixed for future analyses. Only that way you can guarantee consistency over tables. Note that you do lose additivity within the table.

Starting with perturbing the microdata (e.g. with data swapping) is the more "traditional" way of handling general output. However, many statistical institutes only allow "trusted researchers" to access that kind of datasets. "Traditional" ways of perturbation are e.g. dataswapping, PRAM, noise addition, global recoding (more geneeral categories), local suppression, etc. The basic "problem" then is how to make sure that the produced output is "safe", while the utility is "high" enough (the general trade-off in statistical disclosure control). Moreover, often a combination of techniques is used, what makes it much more complicated to assess the risk control and the utility.

Concerning your specific question on means: I know there are some comments/ideas written down on this. There can be some complications when the perturbation is applied independently to the variable in the numerator and the variable in the denominator. I'll try to track down some reference for you.

r-tent commented 2 years ago

Hi @tombisho I dont know anything about DataSHIELD, but I know a little bit about the cell key method :-) Actually it's totally okay to divide a perturbed magnitude table by a perturbed count table, except for the relatively high variance you get by dividing by a perturbed count, if the original count is small. In the 2020 PSD paper of Tobias Enderle, Sarah Giessing an me we showed why, nevertheless you should also perturb the means, whenever you publish perturbed count data. But there are ways to get more reliable results even when using the cell key method :-) I.e. if you don't plan to publish the magnitudes that link to the means, then it's sufficient to just use the perturbation method for magnitude tables and divide the perturbed numerator by the original denominator. Additionally, by now we found out, that in certain cases it's even okay to publish the original mean, IF(!) you round it sufficiently (what sufficiently means has to be checked in advance for the specific data) and refuse to publish means at all, whenever the count is too small (again, this has to be checked in advance for the specific data). But yeah, you really have to check in advance if this is suitable for your specific data and your perturbation parameters. Maybe that helps to give you a notion of if cell key method is still of interrest for you, or not.

tombisho commented 2 years ago

Hi @ppdewolf and @r-tent ,

Thank you both for your inputs. They are useful for our considerations about how we might use CKM in a DataSHIELD setting. As you can see, it is not a simple way forward and I think there is a lot more thinking to do. If we do make progress I will be sure to report back.

Tom

sdcTools / UserSupport

Beginner questions about cellkey #176