hapi-server / data-specification

HAPI Data Access Specification
https://hapi-server.org
23 stars 7 forks source link

Needed elements for FAIR #217

Open jvandegriff opened 3 months ago

jvandegriff commented 3 months ago

To be FAIR, HAPI needs the have these computer readable elements in the citation:

rweigel commented 3 months ago
Dataset name - > in /catalog response, this is required "id".
Publisher -> in /about response, this is the optional "citation" and is the citation for the data provider
Publication year -> in /info response, this is optional "creationDate"
Author(s) -> in /info response, this is the optional "citation" and is the citation for the data producer
PID -> we do not have this
Data usage license (e.g. CC0) -> we do not have this
rweigel commented 3 months ago

For PID, will add "pid" in info response.

https://becker.wustl.edu/news/introduction-to-pids-what-they-are-and-how-to-use-them/

If multiple pids for a dataset (for example, if data response is backed by files, each with a pid), we recommend having a dataset that has pids

rweigel commented 2 months ago

Need to modify /about to have resourceID and modify citation (which can include doi to paper) to be for when resourceID does not exist.

We need to add a licence attribute; verifier should warn if missing

We need to add a provenance attribute maybe modify description to tell people to mention provenance; verifier should warn if missing; think about how to say "sameAs" or "relatedTo" other HAPI datasets.

jvandegriff commented 2 months ago

We need an example of serving a dataset that is a listing of DOIs in the case where the provider has made one DOI per file. The resulting DOI dataset should have a string column with stringType of "doing".

jvandegriff commented 2 months ago

[edited after talking to Jeremy]

We need a provenance attribute in each dataset's info/ response to capture the static details related to the origin of a dataset.

We also could use a separate provenance/ endpoint that would describe the specific upstream source data associated with a data request for a specific time range (often a list of files). This would be a JSON response with it's own schema.

This reflects the fact that there are (at least) two kinds of provenance:

  1. the general info about how the data was made, revision number, processing level, etc
  2. the list of specific resources that were used in fulfilling a data query over a specific time range (like a list of files used to generate a HAPI data/ request)

We should re-use any existing provenance definitions or standard expressions that are already out there.

I suggest the provenance attribute in an info/ response and the provenance/ endpoint have 3 allowed types:

  1. free text
  2. a listing of files from which the HAPI response came (this is very common so I'd like to support it)
  3. use a TBD schema from an existing provenance standard (which may end up including 1 and/or 2)

The main argument for having a separate endpoint is that any time-range-specific provenance details really go outside the scope of what the info/ response is supposed to contain. info/ responses should not depend on the time range - it's really just about the dataset as a whole. Plus, having a variable info/ response would really mess up caching.

jvandegriff commented 2 months ago

For the license attribute, we should use an existing specification for how to exactly and efficiently communicate the license.

SPASE is about to add a LicenseIdentifier field based on the SPDX standard.

There's even a spec for parsing the license string if you want to:

This advice is blatantly stolen from Bobby Candey's ideas from this week's SPASE telecon since this looks like exactly what we need inHAPI too.

jvandegriff commented 1 month ago

Rebecca was recommending to reference vocabulary-based items using this approach:

https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/subject/#subject

SPASE is considering this approach:

 rights = Creative Commons Zero v1.0 Universal

rightsURL = https://spdx.org/licenses/CC0-1.0.html
Creative Commons Zero v1.0 Universal | Software Package Data Exchange (SPDX)

rightsIdentifier = CC0-1.0

rightsIdentifierScheme = SPDX

schemeURI = https://spdx.org/licenses/
SPDX License List | Software Package Data Exchange (SPDX)
rweigel commented 1 month ago

Two proposals for handling licence consistent with how HAPI metadata works:

However, we could do https://github.com/hapi-server/data-specification/blob/master/hapi-dev/HAPI-data-access-spec-dev.md#365-additional-metadata-object

name - identifier (e.g., 0BSD)
content - full license text (optional)
contentURL - HTML version 
schemaURL - SPDX schema link
aboutURL SPDX info link