Data preservation - Githubissues

c-martinez commented 6 years ago

As suggested by @vincentvanhees

What solutions are available for long term storage of data? Can we an inventory of archive options (and add this to the guide).

vincentvanhees commented 6 years ago

thanks, and expanded to short term storage while working with the data too. For example, processing data that is in parallel synced with Dropbox / Onedrive / Googledrive may be a bad idea, but it is often done by domain scientists. What are the risks, and what can be the advise to minimize those risks.

vincentvanhees commented 6 years ago

Actually, I think the discussion drifted off a bit too much into the direction of data archiving. What I would be interested in is to have an overview of how to prevent data corruption in between data collection and data archiving. A lot of researchers temporarily store their data locally before uploading it to some kind of data archive. This is a vulnerability that could lead to corrupt data being uploaded to a data archive.

For example, the NLeSC dementia project will collect data and then store it on an external hard-disk and may inspect it with the Elan software to check it's integrity. Both the software and the hard-disk could hypothetically influence the integrity of the data.

I know similar examples from others researchers I interacted with. In one case the file headers were corrupted and most likely it was because they were checking the data integrity with notepad and accidentally saved the file before closing it.

So, a checklist for data preservation may need to include:

If possible first backup your data before checking its integrity
Backup your data on a certified archive
Only process your data before the backup occurs if it is strictly necessary
Check the integrity of your data handling pipeline before you collect all your data (run a pilot)
Never process files that are also being synced with a cloud platform (like dropbox or onedrive)

c-martinez commented 6 years ago

In this sense, you could also do a checksum of your data when it is collected, and when it is processed to ensure nothing has changed. I wonder thou if this is not too prescriptive? Maybe I would make not make it a check list of steps you must take, but more like a tips-and-tricks summary of tools / practices you could use if there are concerns on data integrity.

vincentvanhees commented 6 years ago

thanks Carlos, expanding my list of 'tips' on how to avoid getting your data corrupted before it arrives in secure storage location:

Do a checksum on your files to check preservation of integrity. This means you will need to store the checksum somewhere.
If you do not plan to change the raw data then set file access permissions to read only.
If possible, first backup your data before checking its scientific integrity (as in: does the content look plausible?).
Backup your data as soon as possible after collection.
Check the integrity of your data handling pipeline before you collect all your data (run a pilot)
try to avoid processing files that are also being synced with a cloud platform (like dropbox or onedrive).

Next, it may also be good to add some tips for secure certified data storage via Surfsara/Dans:

Can we give examples of what certifications the scientist needs to look out for, e.g. a link to Surfsara or Zenodo url with a description of what standards their systems meet? I search a bit but could not find any details.

c-martinez commented 6 years ago

On storing checksums -- yes, you need to store them somewhere. But usually they are tiny, so they can be provided along with the data. In fact, some Linux distributions provide the checksum of the iso image so you can check your image when you download it.

Some links which might be nice to include to the list of tips: https://www.computerhope.com/unix/ucksum.htm https://www.rekha.com/how-to-verify-md5-sha1-and-sha256-checksum-on-mac.html https://linux.die.net/man/1/sha256sum

Just as an idea, would it be nice to write a blog post about data integrity, with kind of a 'story' of why, what and how you should handle your data?

Bob is a biologist, he just got data from his machine. Bob makes a checksum of his data to make sure it does not get corrupted. ... Bob does his analysis... Bob puts his data in zenodo, and refers to it when he publishes his results.

jiskattema commented 6 years ago

Can checksums be stored in the filename? This was common practice for large video files for a while.. ('90s).

rvanharen commented 6 years ago

Hi,

Storing the checksum in the filename is not common practice anymore. Also the short md5sums/sha1 are not considered safe anymore, so you would end up with a gigantic filename going for sha256 or sha512.

Ideally you would want to also sign the archives as well (which would require us setting up a 'ring of thrust'). Only having both gpg-es and checksum-ed the archive ensures you that it is not altered.

I can provide more details if needed.

Ronald

Van: Jisk Attema notifications@github.com verzonden: donderdag 19 april 2018 10:04 Aan: NLeSC/data-sig Cc: Subscribed Onderwerp: Re: [NLeSC/data-sig] Data preservation (#15)

Can checksums be stored in the filename? This was common practice for large video files for a while.. ('90s).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NLeSC/data-sig/issues/15#issuecomment-382647371, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKA0TiYr9LBYgGZFTd0mt9gAivIyxAdDks5tqEUegaJpZM4TANn7.

vincentvanhees commented 6 years ago

in addition to point 4, mention: https://www.surf.nl/en/services-and-products/surfdrive/surfdrive.html and www.re3data.org and zenodo and http://rd-alliance.github.io/metadata-directory/standards/

romulogoncalves commented 6 years ago

Links of interest: https://www.re3data.org/ http://rd-alliance.github.io/metadata-directory/standards/

We will add section to the guide about this topic.

vincentvanhees commented 6 years ago

Let's try to make this action point a bit more specific:

Where in guide should this arrive? Option A - Chapter 6 'Projects' with a new paragraph 'project data storage' Option B - Chapter 4 'Publishing Scientific Results' with a new paragraph on 'publishing data' and a subsection on data preservation in general?
Who and where? I could make a first draft here in this issue, we can discuss that in the next meeting, and after that we could migrate it to a branch of the guide for further editing.

c-martinez commented 6 years ago

1 - I think Option B would make more sense 2 - Yes, please -- first draft would be very much appreciated, then we can iterate over it a few times.

vincentvanhees commented 6 years ago

First crude sketch of the paragraph, I will continue editing this.

?Question: Do we need to make a distinction between database versus file storage?

4.2 Data storage and preservation

We strongly advise to store your research data in a secure location where regular back-ups of the data are made, before you start working with the data. If it is logistically impossible to store the data in a secure location immediately after data collection then here are some tips on how to improve data preservation in the time window in between data collection and data arrival at a secure location. For example, you collect data on humans in an environment without (secure) internet connection and need to temporarily store your data offline on a laptop before being able to upload it to a data archive.

4.2.1 Tips for short term storage

Checksum and sign your data archive:

Do a checksum on your files to check preservation of integrity. This means you will need to store the checksum somewhere, usually they are tiny, so they can be provided along with the data. In fact, some Linux distributions provide the checksum of the iso image so you can check your image when you download it. Storing checksums within the filename is not common practice anymore. A lot of data formats allow storing the checksum in the file; ie. the metadata part contains the checksum of the data part. For example netcdf, and FITST. Here are tutorials on how to do a checksum for Linux (second link), Windows, and Mac OSx.
Sign your archives with GnuPG (download)which would require you to set up a 'ring of thrust'.

Only having both gpg-es and checksum-ed the archive ensures you that it is not altered.

File permissions and location:

If you need to work with your data, but do not plan to change it then set file access permissions to read only.
Try to avoid processing files that are also being synced with a cloud platform (like dropbox or onedrive).
Try to make a back-up if possible and store this back-up at a different physical location.

Specific remarks on human data:

Avoid storing person identifiable information with your data where possible, but use person identifiers instead with the key being stored in a secure location.
If person identifiable information needs to be stored as part of the dataset then make sure the data and data carrier (e.g. hard-drive) is encrypted and the storage procedure complies with a data management plan approved by an ethics committee.
For all human data make sure that only data is stored for which consent was given by the participant or their guardian following the protocol approved by an ethics committee.

4.4.2 Tips for long term storage

For long term storage we advise researchers based in The Netherlands to explore the services of SURFsara website, the Collaborative organization for ICT in Dutch education and research, including but not exclusively:

Surfdrive for secure data sharing up to 250 GB.
Data archive for long term storage of extremely large datasets.

For researchers outside the Netherlands alternative data storing platforms include:

www.re3data.org
https://zenodo.org/
http://rd-alliance.github.io/metadata-directory/standards/

Data storage certificates:

https://serverius.net/about/datacenter-and-network-certification/

...TO BE IMPROVED WITH MORE INFO ON DATA STORAGE VOLUME, FAIR COMPLIANCE, AND COSTS

jiskattema commented 6 years ago

link to surfdrive is broken for me https://www.surf.nl/en/service-and-products/surfdrive/surfdrive.html

jiskattema commented 6 years ago

Also, a lot of data formats allow storing the checksum in the file; ie. the metadata part contains the checksum of the data part. For example netcdf, and FITS

jiskattema commented 6 years ago

Specific remarks on storing human data: Dont do that ;) Also:

This is really a big NO. We cannot guarantee all privacy rights etc. when storing it long term. Also, in all cases clear agreements on storage and removal of this type of data must be made. They should in almost all cases include the complete deletion of all personal data at the end of the project.
Even if project partners want to store this data themselves, we should discuss the case with the relevant privacy experts (at NLeSC and the partners)
"human data" should be "personal data" as used in the GDPR, or in Dutch "persoonsgegevens" as used in the AVG

vincentvanhees commented 6 years ago

1 and 2. Note that this text also needs to account for research where storing person identifiable data is unavoidable like in some branches of medical research. So, we cannot state that person identifiable data cannot be stored and needs to be deleted at the end of a project. That is just not realistic and would cause a huge loss of capital investment, instead we may need to do a combination of discouraging the storage when possible and providing guidance when storage is essential.

I intended this to be broader than the context of GPDR. I was referring to any data collected on humans for which informed consent is required based on approval by medical ethics committees, which also includes data that is not person identifiable. Probably the text needs to clarify which recommendations are based on GPDR and which recommendations are driven by guidelines on research ethics.

jiskattema commented 6 years ago

@vincentvanhees when dealing with personal data we will always and completely follow the GDPR. Period. If that means some research becomes impossible, that cannot be helped. We will not in any way help, facilitate, or advise people or projects that wilfuly go against the GDPR. Also note that these are not 'recommendations' based on the GDPR; they are hard (legal) requirements.

In your case, option 2 would apply. This involves partners that are routinely dealing with personal data, and have all facilities set up for handling and storing it etc. Any changes to their policy should be made in discussion with 'Data protection officers' etc. and should never be decided on based on our guide, only.

For your point 3, you are confusing things. Broader than GDPR means not personal data. There is no personal data where the GDPR does not apply. The requirement for consent etc. is as far as i know it now defined in the GDPR (AVG). Again, what you describe falls under my point 2.

Your last point, i'm not sure i can realistically define cases where research ethics would be less strict than GDPR. It is formulated very broadly, and applies automatically in cases where there is, or can be, doubt about it applying ;)

jiskattema commented 6 years ago

@vincentvanhees To add, if by broader you mean data that has other usage restrictions (licenses, contractual, ...), then there is (should be) a contract defining what we can and must do.

LourensVeen commented 6 years ago

So does that mean that epidemiology as a field is now extinct? My grandfather would be turning in his grave...

Lourens

| Calls for Contributions IEEE eScience |

| We are proud to host | 14th IEEE International Conference on eScience 2018 |

| 29 Oct – 1 Nov 2018 | Amsterdam, the Netherlands | www.eScience2018.comhttp://www.eScience2018.com |

From: Jisk Attema notifications@github.com Sent: Wednesday, May 16, 2018 2:10:01 PM To: NLeSC/data-sig Cc: Subscribed Subject: Re: [NLeSC/data-sig] Data preservation (#15)

@vincentvanheeshttps://github.com/vincentvanhees when dealing with personal data we will always and completely follow the GDPR. Period. If that means some research becomes impossible, that cannot be helped. We will not in any way help, facilitate, or advise people or projects that wilfuly go against the GDPR. Also note that these are not 'recommendations' based on the GDPR; they are hard (legal) requirements.

In your case, option 2 would apply. This involves partners that are routinely dealing with personal data, and have all facilities set up for handling and storing it etc. Any changes to their policy should be made in discussion with 'Data protection officers' etc. and should never be decided on based on our guide, only.

For your point 3, you are confusing things. Broader than GDPR means not personal data. There is no personal data where the GDPR does not apply. The requirement for consent etc. is as far as i know it now defined in the GDPR (AVG). Again, what you describe falls under my point 2.

Your last point, i'm not sure i can realistically define cases where research ethics would be less strict than GDPR. It is formulated very broadly, and applies automatically in cases where there is, or can be, doubt about it applying ;)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NLeSC/data-sig/issues/15#issuecomment-389495920, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQPW0Y-yGP27Zm8DIG_VF-41bFHwMrv8ks5tzBcZgaJpZM4TANn7.

jiskattema commented 6 years ago

@LourensVeen i hope not! you can work with personal data, but you should not depend on a page on the internets to prevent issues with data privacy ;)

vincentvanhees commented 6 years ago

@jiskattema I am not suggesting to violate the law.

GPDR provides freedom for storing personal data within a research context: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5243137/. Also mentioned in this blog post.

I think it is important that we provide guidance in the guide or at least have internal expertise on this issue, because we as NLeSC need to be able to work on personal research data within the limits of GDPR just like hundreds of research groups around NL need to.

Where I wrote "we cannot state that person identifiable data cannot be stored and needs to be deleted at the end of a project. That is just not realistic ...". I meant that if we make such statements then we also need to clarify under what conditions personal data can be stored (within GDPR limits of course).

jiskattema commented 6 years ago

Taking a step back from this discussion: linking to the paper you cited, and noting that if you are working with personal data you really should get expert advise, seems the best option to me. Especially because we are dealing with a default prohibition, with exemptions only under very specific conditions, involving trained privacy experts.

To illustrate how hard it is to write something useful and valid, lets look at your three remarks: Avoid storing person identifiable information with your data where possible, but use person identifiers instead with the key being stored in a secure location. Avoid is too weak, the paper you linked to insists on it.

If person identifiable information needs to be stored as part of the dataset then make sure the data and data carrier (e.g. hard-drive) is encrypted and the storage procedure complies with a data management plan approved by an ethics committee. The document containing this kind of information is typically called a 'verwerkingsovereenkomst' in Dutch (for processing data that you did not collect yourself). Searching for data management plans will confuse people here. Also, the opinion of a data protection officer is needed, not only that of an ethics committee.

For all human data make sure that only data is stored for which consent was given by the participant or their guardian following the protocol approved by an ethics committee. The paper you linked to mentions exceptions for storing without consent.

vincentvanhees commented 6 years ago

Thanks Jisk, I agree. In the data-sig meeting a month ago we agreed that I would sketch a draft for this paragraph and that the sig as a whole would then help to optimize. I am still on a learning curve for most of these topics, so it is great to have your input.

jiskattema commented 6 years ago

@vincentvanhees i know i am not qualified to work as a data protection officer ;)

c-martinez commented 6 years ago

@vincentvanhees, @jiskattema -- I've created a PR to add a this section to the guide. Do you have any concrete suggestions on what/how should we update this section before merging it into the guide?

vincentvanhees commented 6 years ago

How about we try to sit down with two or three people to go over it and make some decisions about what to leave out and what to improve? I missed Monday's data-sig meeting because of the last minute talk by Aletta which messed up my timetable, otherwise we could have done this as part of the sig.

c-martinez commented 6 years ago

Sitting down with two/three people sounds like a good idea to make a first proposal -- afterwards the rest of the sig can add comments/make suggestions as required (nothing we put in the guide is set in stone anyway).

Could I let you and @jiskattema do this first proposal?

romulogoncalves commented 6 years ago

See https://github.com/NLeSC/guide/pull/135

nlesc-sigs / data-sig

Data preservation #15

4.2 Data storage and preservation

4.2.1 Tips for short term storage

Checksum and sign your data archive:

File permissions and location:

Specific remarks on human data:

4.4.2 Tips for long term storage