leela-zero / leela-zero-server

Server side code of the Leela Zero project
GNU Affero General Public License v3.0
67 stars 41 forks source link

GDPR #152

Open gcp opened 6 years ago

gcp commented 6 years ago

As you probably have noticed (LOL), the GDPR is now in effect in Europe. This affects us, because the server sits in Germany and we have European users.

Based on my reading of the law, the hobby and non-commercial like nature of the project exempts us from compliance. However, I believe privacy protections are generally a good thing, and we should set an example and follow best practices wherever possible and reasonable.

In the past I've rejected various enhancement proposals that would have meant storing PII on the server, so we have smooth sailing there. The only PII that is ever stored are the IPs in the server logs (like any web server!). This is allowed without opt-in for abuse tracking and defensive purposes, provided they're not stored longer than necessary. I changed the server configs a while ago to rotate and delete logs much faster (14 days). I'll probably also enable encrypting them soon. The data from the games is already cleaned out of the server on a similar time-frame (required anyway because of storage concerns!).

The IPs are also used to make the "show the last game I generated" feature work. I suspect that is OK as well - it will expire as above.

The only thing I'm not clear about is whether we need some kind of notice about the IPs in the server logs somewhere, and what it should say.

marcocalignano commented 6 years ago

We should be OK if we store IPs for security purposes only, even without ask the permission. But if we use the IPs (like I believe we do) to decide which Clients are fast so we can deliver Matches only to the faster clients then you need a consensus.

gcp commented 6 years ago

Hmm, I see. The claim would be that using it to prioritize matches to fast clients would not be a "legitimate interest" because it could be achieved by alternative approaches? (This is different than showing the last generated game which can only reasonably be done that way!)

I am not entirely sure here. The server could record a token (as we do with matches) and store the send time of the token in the DB. If the client replies with the token, we can measure the delay, and we don't need the IP to associate speed with a client. Because the token is random and immediately discarded it doesn't count as personal information? I'm not entirely sure of the latter, but this does seem to have better privacy properties than IP addresses. (If you need to track over multiple games to get a good speed indication, you need to keep the token longer and this doesn't seem any different from an IP to me)

marcocalignano commented 6 years ago

Inthe case of server Logs you have the "legitimate interest" claim do you are ok! (Even if you should really encrypt them!). For both the other uses you can store an HASH of the IP address. The HASH is not reversible so there is no personal information attached to the HASH. Showing the last generated game also means that my IP is saved in a database, also not "legitimate interest". If you STORE only the hash in the database and if there are no other personal information connected to that HASH we are fine. Also the autogtp patch I suggested few month ago would also avoid any usage IP addresses in the profiling process.

gcp commented 6 years ago

In the case of server Logs you have the "legitimate interest" claim do you are ok!

That has got nothing to do with "legitimate interest". It's specifically allowed for "The processing of personal data to the extent strictly necessary and proportionate for the purposes of ensuring network and information security" which is a different section from legitimate interest provisions.

For both the other uses you can store an HASH of the IP address. The HASH is not reversible

This does not work at all. Most users are on IPv4 and a hash of IPv4 can be trivially reversed. A keyed/salted hash cannot, but if the server must match up IPs with clients that connect, it needs to keep the key alive, which makes the hash reversible again. So hashing would achieve exactly nothing here.

Showing the last generated game also means that my IP is saved in a database, also not "legitimate interest".

It's allowed for abuse prevention, and we've used it to cull spammers from the DB.

Using it for the last generated game is reasonably a legitimate interest because that feature can't be implemented in a way that doesn't store similar or more PII. I would say that it's also reasonable under "whether a data subject can reasonably expect at the time and in the context of the collection of the personal data that processing for that purpose may take place."

I think there's a fair argument that the match scheduling does not pass those, though, if only because that could be implemented with random keys. The relevant difference that makes one work and the other not is that for showing the game you submitted, we need to make a link between your browser and the client you're running, and the connecting IP is the only thing that works for that.

Mardak commented 6 years ago

Would a non-IP-address-based token similar to normal web cookies require a disclaimer or other special handling?

Also, anything not based on IP address will require users to somehow identify / associate with the autogtp that has been talking to the server before it can "show the last game I generated", e.g., autogtp would need to print out the token (it already prints out the task json, so no special changes are required), and the server page could have an input box for the token.

marcocalignano commented 6 years ago

But then I guess the autogtp patch is mandatory because in that case you do not need to save the IP of the client.

gcp commented 6 years ago

Would a non-IP-address-based token similar to normal web cookies require a disclaimer or other special handling?

I don't think so. If we're the only ones to ever handle it, it can't become linked to a person. The thing that made IPs work or not work was that websites can store IP->data, and the ISP can store the IP->person mapping, so they could be combined to form a data->person map.

Obviously you can't store a map from the token to an IP anywhere.

(This makes me realize that using the token over multiple games is probably OK anyway)

Also, anything not based on IP address will require users to somehow identify / associate with the autogtp that has been talking to the server before it can "show the last game I generated"

That also works, I guess. AutoGTP could generate the token on startup.

marcocalignano commented 6 years ago

Autogtp would tell the server the last game duration AT JOB REQUEST TIME. That means that the server can store only the game duration data and statistically calculate a threshold to establish in real time if the client requesting the job, is a fast or a slow client and give this client the appropriate job. NO need to save IPs or to search the DB for IPs.

gcp commented 6 years ago

Autogtp would tell the server the last game duration AT JOB REQUEST TIME.

The client can lie though. Maybe we don't care about that?

If we want to keep showing the latest game then we might as well use a token.

marcocalignano commented 6 years ago

What would be the client gain from lying?

gcp commented 6 years ago

You would be able to force matches towards your client(s), and can then lie about the results, which is exactly what has at least happened in the past.

(Does this not exactly make it an argument that throttling matches via IPs is permissible for data integrity/DDOS prevention, thus allowing their use?)

marcocalignano commented 6 years ago

I guess you can always do that, even now. Just generate 100 games without upload, upload them all at the same time then you are a really fast client and you get the match. Or generate random game really fast.

MartinVingerhoets commented 6 years ago

Instead of using IP adresses, why doesn't the server assign a random id to a client on a first connection and store that somewhere. This way if you want to recheck the speed of a client, you can expire the id and force the client to take another one.

gcp commented 6 years ago

Instead of using IP adresses, why doesn't the server assign a random id to a client on a first connection and store that somewhere.

The client needs to remember the ID, so it might as well generate it itself - that's what is being proposed.

Mardak commented 6 years ago

From some quick reading, it sounds like cookies / tokens / ids (any "identifiers") are "personal data" as it is possible to relate it back to an individual. I suppose at a high level, anything the server can do to show "yours" will mean the server has "personal data."

So assuming that's true (i.e., the server will have "personal data"), that just means the data needs to be handled in compliance with GDPR -- it's not wrong to have personal data. I don't really know what that means to comply ;) but I guess there's some aspect of what gcp has already stated for data retention policies, but also probably needing to provide users consent / choice, e.g., "I want to be able to see my submitted games" or "I want to participate in matches"

gcp commented 6 years ago

From some quick reading, it sounds like cookies / tokens / ids (any "identifiers") are "personal data" as it is possible to relate it back to an individual.

I don't think so? They are a problem if they are shared accross sites, because then you can construct a profile who the person is. But that does not apply for per-site IDs that aren't linked to any other personal data. For IPs, it's a problem with a single site because someone (the ISP) has the database with the mapping to a person. There was an explicit legal decision setting out that reasoning.

that just means the data needs to be handled in compliance with GDPR -- it's not wrong to have personal data.

The problem is that "handling in compliance with GDPR" is quite a nuisance so it's better to have nothing at all (which is a philosophy I certainly like...). For one, you need explicit user consent first and foremost.

Mardak commented 6 years ago

https://www.privacy-regulation.eu/en/recital-30-GDPR.htm

(30) Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags.

This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.

The first line basically treats IP address and cookies equally as identifiers. The second line in our case relates to submitted game data associated to these identifiers. If the server can show a user "your" data, then the server has "personal data."

gcp commented 6 years ago

The second paragraph is critical, no? "leave traces which, combined ... may be used to create profiles and identify them".

The games don't help you towards identifying a person. If they would, we would have bugs :-)

IP addresses do, which is why they are under discussion. A site-unique identifier that is not liked to other data does not either, as far as I can tell.

Mardak commented 6 years ago

A simple example is BRII cluster generating significantly more games than others, so it's fairly easy to identify the individual associated to that data (whether the games were associated to each other via IP address or cookie or other identifier).

gcp commented 6 years ago

If that reasoning holds (it could be anyone with access to a lot of hardware!) I see no other solution than to either significantly expand the site features (have an option to download all games that are in the current DB, have an option delete them, have an option to disable see last game and match assignment) and have AutoGTP present an opt-in at startup, or remove those features altogether.

gcp commented 6 years ago

I mean, if you modify AutoGTP so that it scribbles "Mardak" in the comments of every game, are you then able to claim that I'm storing personal information about you even after anonymizing the IPs?

Or modify the moves such that they spell out MARDAK on the board? Now suddenly the training data has personal information too.

I now need to provide you with all training data where that happened?

That doesn't seem right.

Mardak commented 6 years ago

A similar issue is with encrypted data, say a file sending service allows anonymous uploading data, which the server doesn't understand, for a limited time for others to download. Does that service need to be able to produce "your data" if it doesn't even realize it has anything from you? Would be interesting to see how they deal with GDPR. My guess is that even if the law doesn't have a "best effort" type clause, a judge would maybe favor the service if it truly didn't know.

I suppose other services to compare or inquire about are anonymous / account-less services. Just guessing that GDPR was primarily written in the common case for services that have the usual login / identifier because they want to have a strong / persistent connection to its users.

marcocalignano commented 6 years ago

We could put a disclaimer at the beginning of autogtp that say, 'by using the program you agree that your IP address is saved and use for statistical purposes inherent to the project' and then you have to really explain what you do with their IP. Also you can put in the disclaimer that the user agree, in case he/she modify the code, that any to store other additional personal data given by modified program to the server.

@Mardak if the files are encrypted and the server cannot decrypt them then these are not considerate personal data.

gcp commented 6 years ago

The problem with this is that even though this allows you to store the personal data, you are now liable to all the other obligations such as data takeaway and deletion.