swcarpentry / good-enough-practices-in-scientific-computing

Minimalist alternatives to "best practices" paper
https://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
Other
159 stars 23 forks source link

Metadata #3

Closed noamross closed 8 years ago

noamross commented 9 years ago

Not explicitly in the list so far is metadata, explaining the source of data, meaning of fields, who/where/when collected, etc. There are lots of metadata standards 'good enough' practice in this area probably includes some accompanying data-specific README with prose explaining this information.

PBarmby commented 9 years ago

+1. Have seen too many text data files with no labels, references or anything else.

gvwilson commented 9 years ago

Added explicit requirement for metadata in 5c95d5fc - please let me know what you think.

jduckles commented 9 years ago

The Dublin Core elements might be a good place to start as "good enough"

gvwilson commented 9 years ago

I can't think of a single project I've touched in the last two years that used Dublin Core, but maybe I just wasn't looking.

jduckles commented 9 years ago

@gvwilson Which of these projects you speak of had any metadata standard associated with them?

gvwilson commented 9 years ago

Nothing standard-compliant, but column headers with meaningful names and units, and a separate CSV file documenting the origin of each actual data file (URL, date downloaded, a couple of other things I never cared about).

gvwilson commented 9 years ago

Comment about Dublin Core and other standards included at line 178 - thoughts?

elliewix commented 9 years ago

tl;dr: metadata != readme != codebook. Be clear about what type of metadata you are talking about. Formal metadata usually covers just the dataset as a singleton item and is great, but not the complete picture that reusers need. Formal metadata will not replace a good human readable readme file or a good codebook (if applicable). Follow your research community's standards for how to write a good readme. Look into depositing your data to a repository to get help with providing this information. There are domain (ICPSR), non-domain (figshare), and institutional repositories (usually university hosted). Many domain and institutional repositories offer curation services to help you create the more detailed readme files. Even if they don't have personal service, self-deposit forms will have you fill out the metadata for ingest.

Now, for the wordy version...

There are two types of metadata that often get construed in dataset discussions: metadata about the dataset as a whole and metadata about the content within the dataset. Most metadata schemas you'll encounter are for the former use. They are to describe the dataset as a unit. E.g. author, funder, relevant papers, etc. Pretty much every schema but for DDI does not have elements to explicitly hold codebook-like information.

Dublin Core is one of the most generic schemas around and almost so generic when it comes to data and code that even a hastily written readme file will cover more ground. Qualified Dublin Core might be better, but the elements are so not in line with data or scientific computing that you won't find a good place to put everything you know you should be describing.

Formally structured metadata is often a valueless effort if the dataset will be stored independently and not somewhere in a formal repository. There are some great domain specific metadata schemas out there, and you can certainly use one as a guideline for writing your readme. Beautifully filled out metadata XML files is for ingestion into a repository and/or directory. If the audience is humans, write it for humans. If the audience includes metadata harvesters, fill out the formal metadata and do a readme for the humans.

A formal data repository will already be using these metadata schemes in some capacity, so adding your dataset to one will usually automatically generate that metadata. Bonus yet, they usually forward it on to a harvester so your data will show up in data search engines (example: DataCite DOIs, google scholar, etc.). Many repositories have curators who can help prepare more detailed and formal metadata (sometimes at a price) and other repositories will have self-deposit where you fill out a form based on that metadata.

gvwilson commented 8 years ago

See #29 - @elliewix @jduckles @PBarmby comments welcome.