pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.59k stars 968 forks source link

Establish retention policy and retent old data #3532

Open htgoebel opened 6 years ago

htgoebel commented 6 years ago

I found that pypi stores IP addresses and exact action dates for several years, e.g.

create | Nov 24, 2008, 11:33:29 AM | htgoebel from 79.207.178.171

According to the privacy policy there are four reasons to store the data. FMPOV none of these reasons requires storing this exact data for almost 10 years. The day and user might be of interest, but the exact time and IP address for sure is not.

Please establish a retention policy for delete old data and then delete this old data. Thanks!

Background: As you might know, the European General Data Protection Regulation (GDPR) requires all services offering to the European market to have a retention policy. Also an European court as decided that IP (v4) addresses are personal are personal data too.

dstufft commented 6 years ago

I just want to ack this request, and say that I don't know the answer to this yet, will have to do some internal research to figure out what applies here.

brainwane commented 6 years ago

@dstufft has done some research and we're waiting for him to update this issue with what he learned. :) Thanks for the report @htgoebel.

brainwane commented 5 years ago

@dstufft could you please reply to this thread with your data? cc @ewdurbin with his PSF hat on.

Per our meeting today @nlhkabu is going to do some research on this toward #5863, understanding best practices & prior art in other similar sites. Simply Secure may have some good resources on this.

nlhkabu commented 5 years ago

I couldn't find anything on Simply Secure, but I did manage to find a couple of other sources:

GDPR

Recital 39 of the GDPR states that the period for which the personal data is stored should be limited to a strict minimum and that time limits should be established by the data controller for deletion of the records (referred to as erasure in the GDPR) or for a periodic review.

Organisations must therefore ensure personal data is securely disposed of when no longer needed. This will reduce the risk that it will become inaccurate, out of date or irrelevant.

National Cyber Security Center (UK Gov)

This is a very useful guide: https://www.ncsc.gov.uk/guidance/introduction-logging-security-purposes

Are logs held for long enough to answer incident questions? For each log source you hold, you need to decide how long to store the data. This will depend on a number of factors including the cost and availability of storage, and the volume and usefulness of different data types (see Logging source section below). In general, we recommend that you hold logs which allow you to answer the incident questions from step 2 for a minimum of 6 months. The M-Trends 2018 report suggests that the average time to detect a cyber attack is 101 days and it's not uncommon for this figure to be significantly longer, so you may wish to store for longer if budget allows. Review and fine-tune as necessary.

Prior art

htgoebel commented 5 years ago

Please note that the retention policy must not only include server log files, but also the action log for each of the packages.

woodruffw commented 5 years ago

FMPOV none of these reasons requires storing this exact data for almost 10 years. The day and user might be of interest, but the exact time and IP address for sure is not.

FWIW, exact time and IP address do serve a forensic purpose: they make it easier to triage and establish provenance when doing a postmortem. As an example:

Project Foo has had 50 releases, 45 of which came from an IP range publicly associated with a hosting provider (probably CI) and published within 5 minutes of midnight at timezone X (probably a cronjob). The last 5 releases came from varying IPs, some of which show up in blacklists, and upload times indicate timezone Y.

In terms of policy, it might make sense to research (if any research exists?) the average time between package breach and discovery/triage and use that (with a sufficient window) as the baseline for removing IPs and exact timestamps.

htgoebel commented 5 years ago

FWIW, exact time and IP address do serve a forensic purpose:

Keep in mind: Privacy is a Human Right, but there is not right for forensics.

if any research exists?

Obviously there as been none for the last 10 years. Thus there is no need to keep this data.

Keeping data just for the vague case someone, somewhen might eventually be interested in this data is not a reason, but data retention without legal base.

According to EU-GDPR neither forensics nor research are reasons to give date retention precedence over legally the persons rights.