nazariyv / ocean-data-leakage

This is an entry to: https://gitcoin.co/issue/oceanprotocol/ocean-bounties/27/4461
3 stars 2 forks source link

Correlation measure is pretty imperfect #1

Open FishmanL opened 4 years ago

FishmanL commented 4 years ago

Have you considered using something like microsoft's Whitenoise to do DP analysis? right now your correlation entropy is pretty beatable by using random known noise values

FishmanL commented 4 years ago

also, it assumes that there are input and output files, this should be operating on the input and output results of functions

nazariyv commented 4 years ago

For the benefit of the community, I shall include my comment from the Ocean Protocol Discord chat:

I can only see one. You probably meant the two different points you raised in that issue. To address your first comment on that issue, which is:

Have you considered using something like microsoft's Whitenoise to do DP analysis? right now your correlation entropy is pretty beatable by using random known noise values

I am aware that the correlation isn't the best filter condition, but it certainly is better than (i) not having it there at all; (ii) implementing the suggested "feature correlation" in the description of the bounty. And it is certainly a good starting point to build up on. There a number of different measures to gauge the distance of one document to another like Levenstein's distance, Euclidean Distance, Cosine Similarity, etc. But they all need a certain form for the input / output document to have. Entropy is a generalised metric that we can utilise here. One also has to be mindful about the time it takes to implement all of this, and since I wanted to hack another one of Ocean's bounties, that is the sacrifice that I have made. Feel free to improve on my PR. That is why I have created a YouTube video and documented everything extensively. I will be happy if you extend my solution.

Your second comment:

also, it assumes that there are input and output files, this should be operating on the input and output results of functions

I don't undestand your comment. Ocean's kubernetes cluster will mount the input volume for you, and you can analyse the correlation using that, as well as the data scientist's algo output.