schollz / howmanypeoplearearound

Count the number of people around you :family_man_man_boy: by monitoring wifi signals :satellite:
MIT License
6.9k stars 387 forks source link

Flag to only collect count of people and be GDPR compliant? #50

Open mwargan opened 5 years ago

mwargan commented 5 years ago

Related: https://github.com/schollz/howmanypeoplearearound/issues/31 and https://github.com/schollz/howmanypeoplearearound/issues/4

Would it be possible to add a flag so we do NOT store MAC addresses, and only see an aggregate count of devices at a given timestamp? The program is great, but as it stands, can not be used on public networks in Europe over GDPR compliancy :/

mwargan commented 5 years ago

Ok so not a flag, but I did create a fork and comment out the MAC for GDPR line 250: https://github.com/mwargan/howmanypeoplearearound/blob/master/howmanypeoplearearound/__main__.py

mwargan commented 5 years ago

Hey @AlexNaga! I think that hashing the MAC would make it not anonymous but pseudoanonymous, which means that it could be reversed engineered with more information (like the hashing algorithm).

Your idea to add the date is a good one as well, but the same data can be achieved by just setting a longer scan time, like -s 3600, which would be a safer option as it won't store any MAC address.

The whole problem lies with understanding how many unique devices there were over a period that is an aggregate of the sampling period (e.g. how many people came in a day when we only track how many people in a given hour)? Apparently the problem is so common it has a wikipedia page :D :https://en.wikipedia.org/wiki/Count-distinct_problem

I'm still unsure of what to do, but for now have just commented out the MAC in my fork so ultimately its not stored.

buremba commented 4 years ago

@mwargan you can use Hyperloglog algorithm and push the mac IPs into the hash function for the minimum interval that you want to calculate. If you can create a Hyperloglog instance for each minute and merge them for creating rollups of hour, day or even month. Here is an example implementation in Python: https://github.com/svpcom/hyperloglog