obiba / magmajs

Javascript implementation of Magma library from OBiBa
0 stars 0 forks source link

Investigating the possibility of replicating magmajs elsewhere #1

Open tombisho opened 3 years ago

tombisho commented 3 years ago

I would be interested to know how difficult it would be to provide the magmajs behaviour in a JavaScript engine session running within an R session.

Background:

The magmajs editor in Opal is fine for simple harmonisation. It is a good solution for harmonising as it can be used without the user having full access to the data. However using it for complex harmonisation can be challenging to debug and spot problems. We have been experimenting with generating synthetic data on the client side, which is a realistic representation of the real data, and that the user can have full access to. We then tried loading this into DSLite and using DataSHIELD functions to harmonise, which proved difficult because of the limited functions in DataSHIELD. Therefore the next idea is to try to use the V8 R package, which provides JavaScript in R. Then, the synthetic data can be put into the JavaScript session. JavaScript code can be written to do the harmonisation, and the user can have full access to data and see how things are working. When they are happy, the user can cut and paste the finished JS code into the Opal editor to run on the real data.

Question:

magmajs introduces features that are not available in standard JS. For example in magmajs we would refer to a column with the syntax $("smoking"). In V8, we can pull a dataset into JS but would refer to the columns as my_data.smoking.

V8 does offer the chance to pull JS libraries in, so we wondered if it is possible to pull the magmajs functionality for use with V8. This would mean the code written on the client side with full access to the synthetic data could be dropped into the Opal editor unchanged.

ymarcon commented 3 years ago

That was the idea of magmajs, to have a pure JS solution, that any JS engine could execute (not only V8). But this is quite some work, as there are a lot of features that are currently requiring the Opal context (mostly naming resolution and data extraction). I am also thinking at having R-based Views in Opal, but the problem is that R is too powerful and that does not fit with the use case of not having access to individual-level data.

tombisho commented 3 years ago

Hi Yannick,

I think I have made some progress, using molgenis magma script. This was simply because I could see how to build this into a *.js file from this repo, and I couldn't see how to do this from an OBiBa repo :flushed: I think this is a less complete set of functions, though.

So in summary I simply did this:

  1. Installed the V8 package in R
  2. Ran these steps to to build a js file:
    git clone https://github.com/molgenis/molgenis-js-magma.git
    cd molgenis-js-magma/
    yarn install
    yarn build
  3. Start my V8 session in R ct = v8()
  4. Import magma: ct$source("/home/vagrant/molgenis-js-magma/dist/MagmaScript.min.js")
  5. Load some data and start a console :
    ct$assign("diamonds", diamonds)
    ct$console()
  6. Write a small script and run it:
    ~ my_script = "if($('carat').value() < 0.2){out = 1;} else {out = 0;}"
    if($('carat').value() < 0.2){out = 1;} else {out = 0;}
    ~ MagmaScript.evaluator(my_script, diamonds[1])
    0

    Now I am going to try using my synthetic data to see what it is like trying to write a script with the ability to inspect the data in the RStudio interface.

tombisho commented 3 years ago

About an R views in Opal. I wonder if you could apply the same constraint that is applied to magmajs. In magmajs the code is always executed against a value set, which is a row of a table. Then your R code would just get run on each row of data to transform it and create a new column. On that column you can only do the normal summaries in Opal and later in DataSHIELD

tombisho commented 3 years ago

So D could be your value set, and you could do

if(D$SMOKING > 10){
     out = 3;
} elseif (D$SMOKING > 5) {
    out =2;
} else {
    out = 1;
}

But doing something like D[4, 'SMOKING'] would not give you the number of cigarettes smoked by the 4th participant. It would have no meaning in the row context.

tombisho commented 3 years ago

Linked to this, it would be nice to make some aggregate values available in magmajs, or in the R equivalent. I'm guessing these type of things are available because they are used to generate the summaries in Opal. At the moment it is not possible to subtract the overall mean of a variable (e.g. for normalisation) or reference the standard deviation of a variable. Of course, you can currently do this at the analysis phase via DataSHIELD.

This might actually be more achievable in R via e.g. ds.mean() ? So you could do:

D$SMOKING - ds.mean(GLOBAL$SMOKING)

tombisho commented 3 years ago

Another thing - sorry - I have been looking at ways in which DataSHIELD can 'leak' data. So far I have mainly found this happens where you can repeat a value several times to defeat the nfilters, or by isolating a value with a vector of zeros with a single 1.

I then thought about how data could leak via the magmajs interface and found you could use the 0s/1 isolation method.

First, you could get the value of any variable for a participant with a particular ID by looking at the mean or max in the summary:

var my_val = $id().value()
if (my_val == 100007){
  out = 1
}
else {
  out = 0;
}
out * $('y3alcpatt').value()

or if you don't know their ID, but know some facts about them (e.g. country of birth, education level, approximate BMI) you can get their value, again looking at the mean/max summary:

var cob = $('y1cobcat').value()
var edu = $('y3q104').value()
var bmi = $('y3bmi').value()

if ((cob ==3) & (edu ==3) & (bmi >19) & (bmi <21)){
  out = 1
}
else {
  out = 0;
}
out * $('y3alcpatt').value()
ymarcon commented 3 years ago

This is annoying... do you have an idea on how to protect from that? should opal remove this permission (edit without individual values)? or at least make it configurable with plenty of warnings?

tombisho commented 3 years ago

I've been thinking about this a lot - also for DataSHIELD. The short answer is that I can't think of a way to stop this.

tombisho commented 3 years ago

From an InterConnect perspective, we need the "edit without individual" values otherwise we cannot harmonise the data. It would be a major shift to have to put into place all the data transfer agreements to do the harmonisation locally on the real data or to ask the studies to do the harmonisation themselves.

But equally we have sold the whole approach to our participating groups on the basis that we would not have access to individual values, which is currently not true, both through Opal and DataSHIELD. Having it configurable with the warnings is therefore a bit uncomfortable too, as now we are saying we can access individual values, it's just a bit harder to get them.

I need to think about this some more!!

ymarcon commented 3 years ago

The problem is when you know the other values of the vector, it's not just with 0s/1, so it is even more difficult to detect.

tombisho commented 3 years ago

The problem is when you know the other values of the vector, it's not just with 0s/1, so it is even more difficult to detect.

Yes. I was thinking for any vector that you will display the summary of, count the number of times each value appears and enforce that this must be greater than a threshold. So if you had 100, 100, 100, 100, 100, 27, 100, 100, 100, 100, .... then you would not summarise this because 27 only appears once and all the other values are the same. But it doesn't stop you doing this:

var cob = $('y1cobcat').value()
var edu = $('y3q104').value()
var bmi = $('y3bmi').value()

if ((cob ==3) & (edu ==3) & (bmi >19) & (bmi <21)){
  out = 1000000
}
else {
  out = 0.0000001;
}
out * $('y3alcpatt').value()
tombisho commented 3 years ago

In that example the resulting vector apparently has lots different values and will not be trapped but you can still work out the value you want (it will just be 3.000001 rather than 3)

ymarcon commented 3 years ago

what if opal would "blur" the summary statistics?

tombisho commented 3 years ago

what if opal would "blur" the summary statistics?

Maybe.... I guess the challenge is to get the amount of blurring right. So that it is still useful but not disclosive.

tombisho commented 3 years ago

actually, for a mean, if you add normally distributed noise with mean 0, then the individual values would be obscured but the overall result is the same?

If I have remembered my stats correctly!?

tombisho commented 3 years ago

I think that does work if your sample is big enough....

ymarcon commented 3 years ago

Isn't it an homomorphic algorithm that we are looking for?

tombisho commented 3 years ago

I would say it is more an implementation of differential privacy. My understanding of homomorphic encryption is that it gives the same output from decrypting the operated cipher item as from simply operating on the initial plain item.

ymarcon commented 3 years ago

Looks good: https://github.com/google/differential-privacy

ymarcon commented 3 years ago

Do you know this one? https://www.openmined.org/

tombisho commented 3 years ago

Looks good: https://github.com/google/differential-privacy

Yes quite comprehensive! It looks like this might be quite a large task

Do you know this one? https://www.openmined.org/

No I have not seen that one. I think it is a popular topic but quite focused on machine learning only:

https://owkin.com/federated-learning/

https://www.researchgate.net/publication/344914450_Revolutionizing_Medical_Data_Sharing_Using_Advanced_Privacy_Enhancing_Technologies_Technical_Legal_and_Ethical_Synthesis

tombisho commented 3 years ago

Just coming back to this... the proposed solution in DataSHIELD for these types of attack is to make the data owners aware that they are possible, but not necessarily easy. And to couple that with the fact that all queries are logged so problems could be detected in that way (currently post hoc but in future perhaps proactively). In Opal, is the Javascript execution logged in the same way?