nlesc-sigs / data-sig

Linked data, data & modeling SIG
Other
5 stars 3 forks source link

Even timestamps can be considered person identifiable data #34

Closed vincentvanhees closed 5 years ago

vincentvanhees commented 5 years ago

See the e-mail I am now sending to the Data-SIG, because the issue probably requires internal rather than public discussion.

jiskattema commented 5 years ago

Some comments here (public, could be useful for everyone).

I'm not sure if the timestamps + movement sensor count as personally identifiable data; but timestamp + GPS location would likely be an issue? Anyway, it is always good to consider these things before releasing anything. An easy improvement is to subtract t0 from each timeseries, so they are all 'seconds since start' (but let's discuss offline if this is sufficient in your case)

Some general pointers:

Have a look at Privacy and Data Protection by Design– from policy to engineering, a report for the EU on privacy.

vincentvanhees commented 5 years ago

Thanks Jisk The combination 'id + movement sensor data including timestamps' is in itself indeed not sensitive, but when the same data archive also has available 'id + salary + medical history' then a researcher could request 'id + movement sensor data including timestamp +salary + medical history'. If the researcher has personal knowledge of who took part in the study and when, then that may allow him/her to identify that person based on the timestamps and retrieve all the sensitive information for one or more individuals. Therefore, having a consistent id-number and timestamp of measurement in an archive with human data is a potential risk.

I know that some data owners partially address this by generating a second participant ID with every new data request. This means that you can only link data types within the same release, but not re-use data obtained from different data requests. In this case the data owner has an ID conversion key for every new data release. This limits the problem to only those researchers who request both the movement sensor data with timestamps included and the sensitive salary or medical history data in a single data request.

Subtracting t0 means that information on seasonality and day of week is lost, which often is desired for these kind of analysis. Therefore, the solution these data owners typically use is to only grant secured access when an 'expert' panel has critically looked at and approved a data request. I am highlighting this, because it is an example of where the points you highlighted above are not necessarily sufficient. I suppose there are similarities here with the challenges of sharing data with facial pictures or DNA information.

I am re-opening the issue to make sure at least one of the SIG leads sees it (and then closes it).

c-martinez commented 5 years ago

Hi @vincentvanhees , @jiskattema ,

Thanks for starting this discussion. Very complex issue indeed, and a fine balance between keeping the data anonymous and having useful data. For instance @jiskattema's suggestion of subtracting t0 from every series is a good step to anonymize data but -- as @vincentvanhees rightly points out -- makes it impossible to correlate with seasonal patterns.

The take come message here would be that you always need to think hard about how potentially sensitive is your data and what you want to use it for.

I'm closing this issue, but please re-open it if you would like to discuss this further (perhaps during the next SIG meeting).