Needed elements for FAIR

jvandegriff commented 3 months ago

To be FAIR, HAPI needs the have these computer readable elements in the citation:

dataset name (done)
dataset ID (done)
author
publication date
license
(See email from Rebecca)

rweigel commented 3 months ago

Dataset name - > in /catalog response, this is required "id".
Publisher -> in /about response, this is the optional "citation" and is the citation for the data provider
Publication year -> in /info response, this is optional "creationDate"
Author(s) -> in /info response, this is the optional "citation" and is the citation for the data producer
PID -> we do not have this
Data usage license (e.g. CC0) -> we do not have this

rweigel commented 3 months ago

For PID, will add "pid" in info response.

https://becker.wustl.edu/news/introduction-to-pids-what-they-are-and-how-to-use-them/

If multiple pids for a dataset (for example, if data response is backed by files, each with a pid), we recommend having a dataset that has pids

rweigel commented 2 months ago

Need to modify /about to have resourceID and modify citation (which can include doi to paper) to be for when resourceID does not exist.

We need to add a licence attribute; verifier should warn if missing

We need to add a provenance attribute maybe modify description to tell people to mention provenance; verifier should warn if missing; think about how to say "sameAs" or "relatedTo" other HAPI datasets.

jvandegriff commented 2 months ago

We need an example of serving a dataset that is a listing of DOIs in the case where the provider has made one DOI per file. The resulting DOI dataset should have a string column with stringType of "doing".

jvandegriff commented 2 months ago

[edited after talking to Jeremy]

We need a provenance attribute in each dataset's info/ response to capture the static details related to the origin of a dataset.

We also could use a separate provenance/ endpoint that would describe the specific upstream source data associated with a data request for a specific time range (often a list of files). This would be a JSON response with it's own schema.

This reflects the fact that there are (at least) two kinds of provenance:

the general info about how the data was made, revision number, processing level, etc
the list of specific resources that were used in fulfilling a data query over a specific time range (like a list of files used to generate a HAPI data/ request)

We should re-use any existing provenance definitions or standard expressions that are already out there.

I suggest the provenance attribute in an info/ response and the provenance/ endpoint have 3 allowed types:

free text
a listing of files from which the HAPI response came (this is very common so I'd like to support it)
use a TBD schema from an existing provenance standard (which may end up including 1 and/or 2)

The main argument for having a separate endpoint is that any time-range-specific provenance details really go outside the scope of what the info/ response is supposed to contain. info/ responses should not depend on the time range - it's really just about the dataset as a whole. Plus, having a variable info/ response would really mess up caching.

jvandegriff commented 2 months ago

For the license attribute, we should use an existing specification for how to exactly and efficiently communicate the license.

SPASE is about to add a LicenseIdentifier field based on the SPDX standard.

list of licenses in computer readable form: https://spdx.org/licenses/
you can/should use a short string for the license: https://spdx.github.io/spdx-spec/v3.0/annexes/using-SPDX-short-identifiers-in-source-files/

There's even a spec for parsing the license string if you want to:

https://spdx.github.io/spdx-spec/v3.0/annexes/SPDX-license-expressions/ and https://spdx.github.io/spdx-spec/v3.0/annexes/using-SPDX-short-identifiers-in-source-files/ And it allows for multiple licenses joined via AND or joined via OR, etc.

This advice is blatantly stolen from Bobby Candey's ideas from this week's SPASE telecon since this looks like exactly what we need inHAPI too.

jvandegriff commented 1 month ago

Rebecca was recommending to reference vocabulary-based items using this approach:

https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/subject/#subject

SPASE is considering this approach:

 rights = Creative Commons Zero v1.0 Universal

rightsURL = https://spdx.org/licenses/CC0-1.0.html
Creative Commons Zero v1.0 Universal | Software Package Data Exchange (SPDX)

rightsIdentifier = CC0-1.0

rightsIdentifierScheme = SPDX

schemeURI = https://spdx.org/licenses/
SPDX License List | Software Package Data Exchange (SPDX)

rweigel commented 1 month ago

Two proposals for handling licence consistent with how HAPI metadata works:

license - free text or special syntax. and
licenceScheme - e.g., SPDX. If this field exists, the license string is interpreted as following this scheme. (and/or)
licenseURL - A direct link to the license

However, we could do https://github.com/hapi-server/data-specification/blob/master/hapi-dev/HAPI-data-access-spec-dev.md#365-additional-metadata-object

name - identifier (e.g., 0BSD)
content - full license text (optional)
contentURL - HTML version 
schemaURL - SPDX schema link
aboutURL SPDX info link

hapi-server / data-specification

Needed elements for FAIR #217