nlesc-sigs / data-sig

Linked data, data & modeling SIG
Other
5 stars 3 forks source link

Practical checklist for making data FAIR #13

Closed vincentvanhees closed 6 years ago

vincentvanhees commented 6 years ago

I would like to make a dataset FAIR, but am unsure how FAIR my planned approach is.

My plan: Accessible: I can zip the raw data files and upload them on zenodo, which will produce a doi. Interpretable: There will be a paper that will cites the data and in that way the paper will help to interpret the data. Once the paper is published I can cite the paper in the zenodo meta-data. Reproducible: The description in the paper should make the data reproducible. Findable: By adding enough keywords in the zenodo meta-data I hope it will also be findable.

Does the above make my data sufficiently FAIR? For example: Should I, in addition to writing a paper about the data, also have a detailed data description included in the dataset itself? If yes, is there a generic template for data description I can adhere to or should I follow my own intuition when writing this data description?

On a related note - Is there a practical 'How to FAIR' overview paper or blog post I should read?

jspaaks commented 6 years ago

Without reading any actual content:

https://fair-dom.org/knowledgehub/data-management-checklist/

first link there seems relevant maybe

-Jurriaan


From: Vincent van Hees notifications@github.com Sent: Thursday, January 18, 2018 18:01 To: NLeSC/data-sig Cc: Subscribed Subject: [NLeSC/data-sig] Practical checklist for making data FAIR (#13)

I would like to make a dataset FAIR, but am unsure how FAIR my planned approach is.

My plan: Accessible: I can zip the raw data files and upload them on zenodo, which will produce a doi. Interpretable: There will be a paper that will cites the data and in that way the paper will help to interpret the data. Once the paper is published I can cite the paper in the zenodo meta-data. Reproducible: The description in the paper should make the data reproducible. Findable: By adding enough keywords in the zenodo meta-data I hope it will also be findable.

Does the above make my data sufficiently FAIR? For example: Should I, in addition to writing a paper about the data, also have a detailed data description included in the dataset itself? If yes, is there a generic template for data description I can adhere to or should I follow my own intuition when writing this data description?

On a related note - Is there a practical 'How to FAIR' overview paper or blog post I should read?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NLeSC/data-sig/issues/13, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEWNGV2N3u2ld-qlgHHKdOgaUcGFTXwMks5tL3j-gaJpZM4RjP30.

LourensVeen commented 6 years ago

In my experience from the field of ecology, papers never contain enough details to reproduce a data set, or to understand exactly what the data means. The paper describes your analysis of the data, and typically will omit details that are not relevant to your research, but may be very relevant to the research that the reuser wants to do.

Keep in mind that the reader of the paper does not have the context in which the data were created and analysed, so they're not going to get the exact meaning of column names and categories unless each of them is described in some kind of metadata. And you need that information to be able to judge data quality and suitability to some other research question. And if the reuser wants to combine data from lots of data sets, then the metadata should be machine-readable somehow, because otherwise they'll have to go read hundreds of metadata descriptions and manually map everything to a single semantic standard.

Making data generally reusable is not an easy problem, so I think your question of how FAIR is FAIR enough is very apt.

To make it reproducible, you need to publish the complete software/script that was used to produce it, I think, plus all the source data. That's the only way to ensure that all the details are in there.

vincentvanhees commented 6 years ago

Thanks both, I think I then know what to do. I will focus on creating a description that will at least allow a human to work with the data without having to look up all the associated papers. I have included a sketch description below. The step of making it machine readable is more complicated as there are no formal semantic standards in this research field. However, considering that this type of data is widely collected I hope fusing the data with other datasets is not too complicated.

Sketch of description: The data in this data set contains 56 bin files, 28 txt files, and one csv file collected in a study to [insert brief description of aim and context of study]. The bin files were collected by sensor brand [insert link], serial number xx - xxx, configured with [insert configuration settings], and can be read with open source software [insert details]. The txt files are collected by polysomnography [insert explanation]. The csv file contains a dictionary of all filenames, participant identification numbers, and the age, gender and diagnosis of the participant. A description of the experimental protocol can be found in open access paper [cite paper], in short [insert brief description of protocol]. The data in the bin files are time series with columns A, B, C... [explain what those columns mean and unit of measurement]. The data in the txt files are time series with columns A, B, C... [explain what those columns mean and unit of measurement]. For questions about this data set please contact [insert email address]. Cite ... when using this data set.

arnikz commented 6 years ago

My suggestion would be to look at (some of) these links:

Let's stop here...Of note is that the list is FAIRly incomplete and also a bit biased towards Life Sciences. Nevertheless, it may help in making our checklist, with both the generics and (domain) specifics, increasingly practical.

vincentvanhees commented 6 years ago

Thanks Arnold, the talk by Romulo and Carlos was also great today.

I think it will be important to emphasize in the communication with domain scientists that FAIR does not equal standardisation in data or software.

There is an inevitable diversity in how data is collected, how it is stored, and how it is processed. Efforts to harmonize these are useful but I think diversity should also be encouraged as that is also how innovations come about in science. I know various examples from my own field of how standardisation killed scientific progress. This was also the message I tried to bring across in this paper a few years ago.

I think FAIR should be about making the diversity better manageable, and not about addressing the diversity itself. In other words, if domain scientists like to store their data in 5 different data formats then I think that should not be the focus of the FAIR discussion. Instead, the FAIR discussion should be about encouraging scientists to at least making the data they have FAIR (making it inter operable with one suggested standard, e.g. instructions on how to convert the data to that standard, and not about enforcing a certain data format).

c-martinez commented 6 years ago

@vincentvanhees -- does using fair metrics gives you enough of a 'check list' for the moment? We would like to add such a check list to the escience guide (probably based on fair metrics). So if this works for you at the moment, I will close this issue.

vincentvanhees commented 6 years ago

Yes, you can close the issue. Probably good to merge in parts of your recent lunch talk (maybe good to put the slides here in the repo?). Thanks

c-martinez commented 6 years ago

Good idea -- @romulogoncalves , can you add a link to the slides?