NOT A FEATURE REQUEST, moreso a coaching session

chartiersteven commented 2 years ago

I think anonimatron is pretty neat. But RANDOMDIGITS didn't behave quite behave the way I needed it to behave. I needed it to not only make sure the "123" generated synonym different than the original "123" digits but I also needed it to make sure that if "123" generated synonym "456", then no other key would generate "456". Basically, I needed to know RANDOMDIGITS was capable to generate a pseudo-anonymized primary key. So I created RANDOMDIGITSUNIQUE. It's not rocket science ... just needed to have a "used array list" companion to the synonym's hashmap. I also had to do a few small things like update the Oracle Library so that it would handle Oracle 19 PDB's which use Service Name rather than SID. All that's pretty good but I'm a beginner for giving back on GitHub and Java is more like my 3rd language than my first language. Would anyone volunteer to walk me through setting up a branch, pushing and/pulling to it and anything I should do to contribute a more responsible change. I'm sure I could find this out with google and some time. I'm sure someone could help accelerate me. I've got an intellij project, if it matters. I can be reached at steven@thechartiers.com. Thanks for considering this. I'm flexible / not in a rush.

realrolfje commented 2 years ago

If we need to guarantee that the same value is not generated twice for different source fields, I think a better option would be to have Anonimatron check that, instead of one single Anonymizer.

The problem with storing generated numbers in an array list is that it will not guarantee uniqueness between two runs. The Roman Name and Elven Name Anonymizers have the same problem in that respect (unique names in this run, but not guaranteed to be unique for successive runs).

Just out of interest (maybe there is an even better solution) is there a specific reason why you need to have guaranteed unique numbers?

chartiersteven commented 2 years ago

Hi realrolfje, You're absolutely correct. I did make a new anonymizer but those changes were NOT sufficient for the feature. I also had to make changes so that Anonimatron could enforce uniqueness for my new anonymizer, i.e. the existing extensibility model is not sufficient for the kind of change I needed. So I made more so that the synonym list could be appropriately used, just as you say. It's important to be consistent between runs, just as you say.

And yes, there is a specific reason. Clinical data is (often) already de-indentified, unless it is an open label trial. So a subject number wouldn't "seem" like something you'd label as identifying (as it has no context outside of one trial or trial-site). But pharma considers the subject number as identifying because anyone that can get to a subject number can eventually get to a person. This is because the subject number has the equivalent of a synonym list for things like emergency-code-break. So now, if I really want to anonymize identifying clinical data, I must anonymize keys such as subject-id, site-id, visit-id. Strictly speaking, even this form of anonymization is insufficient because anyone can eventually recognize a subject in a trial with sufficient quasi-identifying and sensitive information. But it's a good start. The id must be unique so that I maintain referential integrity of the data until I address other issues such as inserting noise, etc. Does that make sense?

realrolfje / anonimatron

NOT A FEATURE REQUEST, moreso a coaching session #171