openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Retention of personal information in logs #219

Open pnorman opened 6 years ago

pnorman commented 6 years ago

This is one of my action items from LWG privacy policy matters

We need a defined duration for how long we retain personal information in logs. The precise duration doesn't matter, but we need to state how long in the privacy policy, and have some justification.

The main personal information normally in logs are IP addresses, user-agents, referers, and what they requested.

I propose splitting logs into four groups

  1. Logs from read-only secondary services (e.g. tile servers, nominatim)
  2. API and website logs for normal requests
  3. Signup related information, which may not be stored in conventional logs (e.g. IP used for account creation on website, wiki, etc)
  4. Other (dev.osm.org, trac.osm, git.osm, etc)

As a starting point for consideration, how about these times?

  1. 180 days. Beyond this, I only see value in the logs in aggregate, e.g. user-agents usage tracking, service-based analysis like tile request logs. If we had a good automatic system of aggregating all the logs a shorter time would be better for that service.
  2. 1 year. We've occasionally had to look at past activity when debugging, and presumably will in the future, so individual requests in the log have more value.
  3. For as long as they have an account on that service and two years past that. This is essential for spam and abuse investigation
  4. One year?

If we have a reason to retain a specific log for longer like an ongoing investigation, court order, etc, we could do so. The goal of a log retention policy is to establish defaults when there's not some special case.

cc @simonpoole

pnorman commented 6 years ago

Edit: Changed (1) to 180 days, which agrees with existing piwik setup.

Firefishy commented 5 years ago

Current web access logs is as following:

simonpoole commented 5 years ago

Current web access logs is as following:

* Tile Cached: 1/1/2016 onward. All Logs kept. (Stored in Archived)

I can see why we would want to keep this "forever" but anything older than a couple of months could have the IP addresses truncated or otherwise anonymised without impacting the use for stats.

* Planet: 2009 onward. All Logs kept. (Stored in Archived)

Same as above.

* www: 2010 onward. All logs kept. (Stored in Archived)

Same as above.

* wiki: Approximately last 2 weeks rolling. (local logrotate)

Unproblematic.

* nominatim: Approximate last 2 months rolling. (local logrotate)

Unproblematic.

* dev: ~8 years, but varies on popularity of site (local logrotate)

As we have stuff on dev that amounts to public services, I would suggest reducing this to two months or so.

* lists, svn, git: ~2 months (local logrotate)

Unproblematic.

tomhughes commented 5 years ago

I'm not sure I believe that dev number to be honest - we don't have any logrotate that would do anything like that far as I know.

If @Firefishy meant the rails logs in the logs directory of each checkout then those aren't being rotated at all as far as I know.

There's no real reason to keep the archived logs so long, we're just never set up anything to clean them out. Anonymising them would be a huge amount of work for little gain.

nemobis commented 3 years ago

Anonymising them would be a huge amount of work for little gain.

Probably true in general, but for many kinds of logs there's already cryptolog, used for instance by the Internet Archive.