Defining a datapackage.json for Movebank data

peterdesmet commented 4 years ago

I'm attempting to convert a published Movebank dataset into a Frictionless data package, by describing the files and their structure in a datapackage.json file. I created a use case for this here. This issue is to capture feedback.

Link to file: datapackage.json

[ ] Add rdfType to terms with link to term URL
[ ] Should term URL be the versioned one?
[ ] Should we include the definition?

peterdesmet commented 4 years ago

Feedback by @sarahcd:

First thoughts: Could/should we include the URIs with the attribute info? Is it appropriate to add package-level info like license and citations, and if so how? Could we write script to create the files with R (easiest option for me unless that is a terrible idea for some reason)? Can this be written in a way that facilitates ingestion to GBIF? Feel free to start a new thread or move to github....

peterdesmet commented 4 years ago

My answers to @sarahcd:

Yes, we should include the URIs with attribute info. I think that is done with "rdfType". I will add those.
I've purposely not included any package level metadata like license and citations, because 1) I didn't want to repeat information that is already available at the (Zenodo) repository (and then having to maintain both) and 2) I want to have the flexibility to update the dataset metadata without creating a new version (updates to files on Zenodo trigger a new version, while updates/corrections to metadata do not). So I just used the datapackage.json for the technical description of the data. It is possible however to include metadata: I tried in an earlier version of the file.
Yes, we could create the file with an R script. I'm not sure if datapackage-r would be of use here, but what we need is:

1. provide file names of csv files + their headers
2. get information about those headers from the movebank attribute dictionary (is it available as xml/json?)
3. write that information as a datapackage.json

This would not be ingested by GBIF directly. The datapackage.json would be used to understand the structure so it can be converted to Darwin Core.

peterdesmet commented 4 years ago

@sarahcd why do tag id and tag local indentifier (sic) both exist as concept in the Movebank Attribute dictionary? They are the same concepts and have the same definition. It is unclear which one I should refer to (I think tag id). Same for animal id vs animal local identifier.

sarahcd commented 4 years ago

The two versions of tag/animal identifier labels are something created a long time ago, and I'm not sure what the rationale was. tag-id/animal-id are the names used in reference data downloads, and tag-local-identifier/animal-local-identifier are he names used in event data downloads. As you see in the vocabulary I use the alternative label to show they are the same thing. I could delete one set of the entries so that there is just tag-local-identifier/animal-local-identifier with the alt labels. When I make readme files I use tag/animal id, which is an arbitrary decision; fyi "local-identifier" indicates it is the user-defined ID rather than a numeric identifier assigned by the database.

Extra metadata in readmes: I see your point and certainly have cases where I update the DataCite metadata and of course cannot update the readme.txt. I hesitate to eliminate the text readme entirely from the repository, because they are human readable and easy to store locally, which reduces provenance loss especially if files get passed around after download and exists without the internet. Does Zenodo offer a human readable download option for the update-able metadata?

peggynewman commented 4 years ago

Re 2. get information about those headers from the movebank attribute dictionary (is it available as xml/json?)

Can the machine readable terms be harvested from the NERC vocabulary service?

peterdesmet commented 4 years ago

@peggynewman this is something @sarahcd will inquire NERC about, as well as adding data type and format for terms. That way, a datapackage.json could be build from the NERC vocab.

sarahcd commented 3 years ago

Note: If you go to a NERC vocab, e.g. vocab.nerc.ac.uk/collection/MVB, see Alternate Profiles in the upper right for several JSON formats. I haven't explored them yet but they look potentially useful for harvesting.

peterdesmet commented 2 years ago

A datapackage.json file as described above can now be generated automatically with the movepub R package (see tutorial).

This is done by making use of the general purpose frictionless R package and looking up the definition and URL for every field in the Movebank Attribute Dictionary (using the get_mvb_term() function). The resulting data in datapackage.json looks as follows:

{
  "name": "tag-id",
  "title": "tag ID",
  "description": "A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown. Example: '2342'; Units: none; Entity described: tag",
  "type": "string",
  "format": "default",
  "skos:exactMatch": "http://vocab.nerc.ac.uk/collection/MVB/current/MVB000181/2/"
}

skos:exactMatch was chosen over rdfType. Full example at https://github.com/inbo/bird-tracking/blob/master/data/processed/O_ASSEN/datapackage.json

The issues regarding synonyms or improvements to be made to the attribute dictionary are discussed in the movepub repo.

Closing this issue.

tdwg / dwc-for-biologging

Defining a datapackage.json for Movebank data #30