toltec-dev / organization

Documents, policies and meeting minutes of the Toltec organization.
0 stars 1 forks source link

Define privacy policy #5

Open matteodelabre opened 4 years ago

matteodelabre commented 4 years ago

The published repository runs nginx, which keeps logs about who accesses the files. Collected information is (following default nginx rules):

We should probably define a retention time for this information, and maybe reduce the scope of collected information (e.g. anonymise IP addresses after a given time). The logs could be used anonymously to publish stats about which packages are used.

Eeems commented 4 years ago

Is this a prerequisite for toltec-dev/toltec#15?

matteodelabre commented 4 years ago

I think it is. I’ll add it to the milestone. Thanks!

LinusCDE commented 4 years ago

You could probably use fluentd to redirect the logs into some database. From there it would be easier to auto-delete or not add the ip at all to a db.

Here is a fluentd.conf i made and used: https://gist.github.com/LinusCDE/9ba8b79f115272dcbe2371cacb815288

There is also a docker-compose you can use to spin up a db and have a simple interface to look into with.
The elasticsearch part can be removed, though you can also go down the rabbit hole of using that with Kibana and get a lot a statistics very easily.

The cool thing about fluentd is, that it can take a lot of stuff (docker natively supports logging to them as a driver) and spit out nicely formatted json per log entry that can be sent basically anywhere.

Here is a sample entry from my mongo db that fluentd put there (was a nginx log entry with added server_name and got machine_id added by fluentd):

{
    _id: ObjectId('5ec42b813395ae000fde7593'),
    remote: 'xxx.xxx.xxx.xxx',  # IP removed
    host: '-',
    user: '-',
    method: 'POST',
    path: '/api/v4/jobs/request',
    code: '204',
    size: '0',
    referer: '-',
    agent: 'gitlab-runner 12.10.1 (12-10-stable; go1.13.8; linux/amd64)',
    machine_id: 'ozelo',
    time: ISODate('2020-05-19T18:54:33.000Z')
}

Whether nginx doesn't log the IP, fluentd removes it or it gets periodically removed by some client connected to the database is up to you.

One could probably also use a grafana server to have statistics of the data in the MongoDB (or whatever backend you choose).

If you need help regarding the fluentd, mongodb or grafana setup, feel free to ask.

matteodelabre commented 3 years ago

Here’s the relevant section of the GDPR regarding whether it is necessary to obtain user consent before collecting and processing user information. In particular, consent is not required when the processing is necessary for compliance with a legal obligation or for “legitimate interests”. I would say that keeping a log containing IP addresses and user-agents, at least for a set amount of time, is necessary for security purposes. The French law actually mandates that such logs be kept for one year (not sure about other countries).