Open jvandegriff opened 3 months ago
Dataset name - > in /catalog response, this is required "id".
Publisher -> in /about response, this is the optional "citation" and is the citation for the data provider
Publication year -> in /info response, this is optional "creationDate"
Author(s) -> in /info response, this is the optional "citation" and is the citation for the data producer
PID -> we do not have this
Data usage license (e.g. CC0) -> we do not have this
For PID, will add "pid" in info response.
https://becker.wustl.edu/news/introduction-to-pids-what-they-are-and-how-to-use-them/
If multiple pids for a dataset (for example, if data response is backed by files, each with a pid), we recommend having a dataset that has pids
Need to modify /about
to have resourceID
and modify citation
(which can include doi to paper) to be for when resourceID
does not exist.
We need to add a licence
attribute; verifier should warn if missing
We need to add a provenance
attribute maybe modify description to tell people to mention provenance; verifier should warn if missing; think about how to say "sameAs" or "relatedTo" other HAPI datasets.
We need an example of serving a dataset that is a listing of DOIs in the case where the provider has made one DOI per file. The resulting DOI dataset should have a string column with stringType
of "doing"
.
[edited after talking to Jeremy]
We need a provenance
attribute in each dataset's info/
response to capture the static details related to the origin of a dataset.
We also could use a separate provenance/
endpoint that would describe the specific upstream source data associated with a data request for a specific time range (often a list of files). This would be a JSON response with it's own schema.
This reflects the fact that there are (at least) two kinds of provenance:
data/
request)We should re-use any existing provenance definitions or standard expressions that are already out there.
I suggest the provenance
attribute in an info/
response and the provenance/
endpoint have 3 allowed types:
The main argument for having a separate endpoint is that any time-range-specific provenance details really go outside the scope of what the info/
response is supposed to contain. info/
responses should not depend on the time range - it's really just about the dataset as a whole. Plus, having a variable info/
response would really mess up caching.
For the license
attribute, we should use an existing specification for how to exactly and efficiently communicate the license.
SPASE is about to add a LicenseIdentifier
field based on the SPDX standard.
There's even a spec for parsing the license string if you want to:
This advice is blatantly stolen from Bobby Candey's ideas from this week's SPASE telecon since this looks like exactly what we need inHAPI too.
Rebecca was recommending to reference vocabulary-based items using this approach:
https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/subject/#subject
SPASE is considering this approach:
rights = Creative Commons Zero v1.0 Universal
rightsURL = https://spdx.org/licenses/CC0-1.0.html
Creative Commons Zero v1.0 Universal | Software Package Data Exchange (SPDX)
rightsIdentifier = CC0-1.0
rightsIdentifierScheme = SPDX
schemeURI = https://spdx.org/licenses/
SPDX License List | Software Package Data Exchange (SPDX)
Two proposals for handling licence consistent with how HAPI metadata works:
license
- free text or special syntax.
andlicenceScheme
- e.g., SPDX. If this field exists, the license
string is interpreted as following this scheme.
(and/or)licenseURL
- A direct link to the licenseHowever, we could do https://github.com/hapi-server/data-specification/blob/master/hapi-dev/HAPI-data-access-spec-dev.md#365-additional-metadata-object
name - identifier (e.g., 0BSD)
content - full license text (optional)
contentURL - HTML version
schemaURL - SPDX schema link
aboutURL SPDX info link
To be FAIR, HAPI needs the have these computer readable elements in the citation:
(See email from Rebecca)