Open rweigel opened 1 year ago
This applies to all issues tagged "association". It seems that RDF was invented to address associations. So I am planning on studying https://www.w3.org/TR/rdf11-primer/.
Hi Rebecca,
We are about to attempt to tackle the problem of connecting datasets in the HAPI specification. A few of the things we want to be able to express:
Another version of this dataset is at a different cadence or quality control level (e.g., L0, L1, ... in NASA terminology and preliminary, quasi-definitive, and definitive in magnetometer speak).
This dataset contains only data when in burst mode. The nominal mode data are in another dataset.
This dataset was used to generate an event list dataset
The files used for this dataset over a given time range are available by requesting a hapi server using a different dataset ID with the same time range.
This dataset is from a satellite that is part of a constellation (e.g., RBSP-A, RBSP-B) (SPASE does this, so we may not need it).
This seems to be an RDF use case. Do you have any suggestions on how we should proceed or know anyone with experience with this that could help?
I am not experienced in RDF, but Ryan (cc’d) is. Catherine, our digital librarian, is also experienced in metadata. In general, I recommend imitating or copying the DataCite schema (https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf), especially for datasets, and then mapping/copying that to other schema (e.g. HAPI).
Time series data is also a current topic in the ESIP SOSO group (second link). Some earth science groups have been using the approach linked below to get time series data in schema.org, too. Likely another good link to HAPI.
Rebecca
From Baptiste:
Yes, indeed that is a nice use case.
The first step would be to check what relations are needed (e.g.: build an information model / schema), and build a "graph" with nodes and relations.
E.g. (not formally in any language, but just to propose something to start with :-):
dataset from_observatory RBSP . dataset has_distribution distrib0 . dataset has_distribution distrib1 . dataset is nominal_mode . dataset see_also dataset_L0 .
distrib0 has_resolution 1 sec . distrib0 has_hapi_server https://...hapiurl... distrib0 has_hapi_dataset hapi_dataset_id0 distrib0 other_resolution distrib1 .
distrib1 has_resolution 10 sec . distrib1 has_hapi_server https://...hapiurl… distrib1 has_hapi_dataset hapi_dataset_id1 distrib1 other_resolution distrib0 .
dataset_b is burst_mode . dataset_b is_supplement_to dataset .
Then, see if there are existing terms/relations already available in other schemas/ontologies.
For instance, the concept of "dataset" is rather well defined in DCAT (https://www.w3.org/TR/vocab-dcat-3/), and it allows to describe the "distribution" of the dataset.
In particular, see this: https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution (which already has temporal resolution even
The ESIP SOSO link I sent has a link to their living agenda, which has several useful links on how other sciences are approaching this using the schema.org structure.
Before Wednesday's meeting, review info bout RDF and DCAT definitions. See if Doug L has any input.
The building blocks of RDF are subject-predicate-object triples that each represent a single fact, as exemplified by Baptiste above. You can link the same subject (e.g. a dataset) to multiple objects via a meaningful predicate. This is similar to defining properties for an object but more loosely coupled. An object from one triple can be used as a subject for another triple thus allowing you to build a graph. If you only want to hang properties off of a dataset without deeper linking, RDF might be overkill. Though you could still take inspiration from various ontologies for naming things.
The "R" is for "resource" so think of each of those three triple components as resources with unique identifiers and well defined semantics. There is no limit on how you name these things but it is clearly more useful if you adopt a preexisting ontology (think schema). Schema.org and DCAT seem to be the most popular for dataset related metadata. Google dataset search claims to support both, though it seems like the emphasis is on schema.org. DataCite also seems like a reasonable way to link related resources. Maybe even SPASE? At LASP, we take most of our inspiration from DCAT. We've added our own concepts to better capture our needs. We then strive to be able to crosswalk our metadata to other ontologies/schema.
Another important part of RDF is to be able to share your metadata in a standard format. JSON-LD (for "linked data") seems to be a common option. If we embrace RDF here, we might want to rethink the "info" response.
Ideas on dataset relationships
Cadence (this one is special so that it is machine interpretable) This needs work to clarify to be specific enough to be machine usable, but not tangled in the weeds. Needs some isomorphism amongst parameter names so that potting tools can easily switch between datasets and still plot the same parameters. Key point: we want to support Eelco's use case of auto-selection of cadence by a timeline viewer (client-side plotting tool). The right kind of descriptor could specify a linkage at the dataset level (dataset A is linked to dataset B by cadence) and if the datasets are not similar enough to do this (i.e., the parameters have different names, even though they are just different by cadence), then the linkage descriptor could specify connections between specific parameters.
Maybe have this in a separate endpoint for linkages
or relationships
? Linkages are a kind of overlay, but it would also be nice to see it in the info
response.
Argument for external: while having it in info
is convenient, we likely need to have an external place to manage the complexity of the different kinds of linkages.
Others could be denied, but then are not necessarily machine interpretable - up to people to use as needed Calibration Level (NASA Level 0, Level 1) Processing Version Quality level Transform (FFT, coordinate, statistical (min/max). background removal)
/relations
returns this information in some JSON from similar to
[ [server:dataset1, hasRelatedCadence, server:dataset2], [server:dataset1, sameMissionName, server:dataset2], [server:dataset1:parameter1, isATransformedVersionOf, server:dataset1:parameter2], [server1:dataset1:parameter1, isCoordinateTransformOf, server2:dataset1:dataset3], [server1:dataset1, isDifferentCalibrationLevelOf, server1:dataset4] [server1:dataset5, isFileListingOf, server1:dataset1] [dataset1, x_SameReviewerAs, dataset4] ]
We define a list of predicates. No need to specify reverse relationships. Look into RDF predicates for dataset relationships.
Next task: Come up with JSON schema for above.
To make full use of associations between datasets for interactive plotting, the association between datasets is a first step. But we need to also be able to have a meaningful mapping on the parameter level. For example, I currently work with high rate satellite datasets, where it is useful to also have a low rate dataset with the per orbit (or lower cadence) minimum, mean and maximum of some (but not all) of these parameters. The OMNI datasets also have different parameter names for the same observable at the 1min, 5min and 1hr cadences. Is this something that can be accomplished with RDF, schema.org and the like? I’ll have to look into it.
As mentioned by @dlindhol, and since we use JSON in the HAPI headers, opting for JSON-LD (or other linked-data flavour) is important for interoperability (as usual). I hope we don't reinvent yet another linked-data format.
We also should reuse predicate from existing ontologies so that our links are understandable by generic tools.
As a by-product, we would have a better FAIR score when assessing our products/services with FAIR assessment tools.
@BaptisteCecconi We decided to use a very basic schema like the one above. The motivation for keeping the schema minimal is so that it will get used. If server developers need to learn something RDF, JSON-LD, etc. to communicate the linking information, the information is unlikely to get provided. As we develop the schema, we'll develop in parallel software and/or a service that crawls all HAPI servers and provides what is needed for interoperability.
@eelcodoornbos
Our thinking is that you would take the response from /relations
, which could have
[server:dataset1, hasRelatedCadence, server:dataset2]
and inspect the metadata for dataset1
and dataset2
to determine the available cadences. We started discussing how to provide more details in response to a /relations
request and realized that RDF is the solution, but it is too complex (see my response to @BaptisteCecconi.
It seems like some of these relationships have properties that could be associated with them.
So instead of this: [server:dataset1, hasRelatedCadence, server:dataset2]
You can add the list of parameter mappings too, with the mappings going from dataset1 to dataset2 [server:dataset1, hasRelatedCadence, server:dataset2, param1_in_dataset1:param1_name_in_dataset2, param2_in_dataset1:param2_name_in_dataset2, param3_in_dataset1:param3_name_in_dataset2, param4_in_dataset1:param4_name_in_dataset2, param5_in_dataset1:param5_name_in_dataset2 ]
But then the statistics info (min, max mean in the averaging interval) in the longer cadence dataset are actually additional parameters, and they have specific meanings. Both Eelco and Jeremy wanted these kinds of summary stats for averaged parameters.
Are these kinds of averaging stats common enough that they belong in the relationship mapping language? Seems like the might be. Especially if there are already terms for this in one of the standard set of relationship names that Baptiste mentioned.
We should look at the existing, standard sets of RDF relationships and relationship terms and try to use them since we are ultimately looking to map to them anyway (with the standardizing layer that Bob mentioned).
I have to support Baptiste's argument here about not making up our own syntax for linking data. I have now looked a bit into JSON-LD and it does not seem so complicated and it looks to be quite well supported for programmers.
I also prefer the idea behind it that relations/links are defined where the data items (in our case the datasets and parameters) are defined. So as additional items under the /hapi/info endpoint, instead of, for example, having a separate 'relations configuration document' under a /hapi/relations endpoint, which would then contain some duplication of the structure we already have in the /hapi/catalog and /hapi/info endpoints. This would also add a burden of keeping this duplicated structure consistent. To me, it seems much easier to give HAPI server developers the option to expand the /hapi/info endpoints with some JSON-LD elements instead.
It looks like the JSON-LD libraries would be helpful for crawling HAPI servers, to create the relations graph that can then be used in applications like the timeline viewer.
Thanks @eelcodoornbos :-)
Just as an example: I recently looked up the W3C Annotation standard, which proposes JSON-LD as their preferred serialisation. The have prepared a specific context
JSON-LD file, so that the JSON-LD instances are not cluttered with namespaces and prefixes.
So if we prepare a dedicated HAPI JSON-LD context file, then the JSON-LD section of the HAPI response could be rather straightforward to write (and validate).
@BaptisteCecconi—perhaps a simple example would help clarify things. Suppose we wanted to say dataset1:parameter1 is the same as dataset2:parameter1 except for cadence. What would that look like in JSON-LD? I've reviewed these documents many times and have concluded I'd need much more time to understand them enough to use them.
I found this useful: https://developers.google.com/search/docs/appearance/structured-data/dataset
I recall discussing the fact that we should create json-ld for HAPI servers. It would be something an external resource builds based on HAPI JSON responses.
In terms of syntax, the choices from https://schema.org/Dataset are limited: of hasPart, isPartOf, isBasedOn.
@BaptisteCecconi—perhaps a simple example would help clarify things. Suppose we wanted to say dataset1:parameter1 is the same as dataset2:parameter1 except for cadence. What would that look like in JSON-LD? I've reviewed these documents many times and have concluded I'd need much more time to understand them enough to use them.
The first step is to build the information model (the predicates). So far I saw:
cadence
calibrationLevel
processingVersion
qualityLevel
transform
parameter
missionName
sameMissionName
isATransformedVersionOf
isCoordinateTransformOf
isDifferentCalibrationLevelOf
isFileListingOf
reviewer
sameReviewerAs
{
"@context": "https://github.com/hapi-server/rdf/hapi-context.json",
"@id": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI¶meters=BGSMc",
"type": "Dataset",
"conformsTo": "HAPI",
"cadence": "PT92S",
"otherCadences": [
{
"@id": "uri_to_dataset_with_cadence_300s",
"cadence": "PT300S",
},
{
"@id": "uri_to_dataset_with_cadence_10s",
"cadence": "PT10S",
},
]
"calibrationLevel": "Calibrated",
"processingVersion": "K0",
"parameter": "BGSMc",
"missionName": "wind",
"instrumentName": "mfi",
"sameMissionName": [
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WIND_3DP_ECHSFITS_E0-YR",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_AT_DEF",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_AT_PRE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EHPD_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EHSP_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELM2_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELPD_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_ELSP_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EMFITS_E0_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EM_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIFFERENTIAL-ION-FLUX-1HR",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-CNO-FLUX-10MIN",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-FE-FLUX-10MIN",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-H-FLUX-10MIN",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_EPACT_STEP-DIRECTIONAL-DIFF-HE-FLUX-10MIN",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@0",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@1",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_MFI@2",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H0_WAV",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_WAV@0",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H1_WAV@1",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@0",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@1",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3-RTN_MFI@2",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H3_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H4-RTN_MFI",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H4_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H5_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_EPA",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SMS",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SPHA",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_K0_WAV",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-1HOUR-SEP_EPACT-APE_B",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-1HOUR-SEP_EPACT-LEMT",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-AFM-MAGNETOSPHERE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-AFM-SOLARWIND",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-ERPA-MAGNETOSPHERE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-30MIN_SMS-STICS-ERPA-SOLARWIND",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-3MIN_SMS-STICS-VDF-MAGNETOSPHERE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-3MIN_SMS-STICS-VDF-SOLARWIND",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2-5MIN-SEP_EPACT-LEMT",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2_3MIN_SMS-STICS-NVT-MAGNETOSPHERE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L2_3MIN_SMS-STICS-NVT-SOLARWIND",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_L3-DUSTIMPACT_WAVES",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_M0_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_M2_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_OR_DEF",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_OR_PRE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_PLSP_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_PM_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SFPD_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SFSP_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SOPD_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SOSP_3DP",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_STRAHL0_SWE",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_SW-ION-DIST_SWE-FARADAY",
"https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_WA_RAD1_L3_DF"
],
"isATransformedVersionOf": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI",
"isCoordinateTransformOf": "https://cdaweb.gsfc.nasa.gov/hapi/info?id=WI_H2_MFI#BGSEc"
}
The @context
file (https://github.com/hapi-server/rdf/hapi-context.json
), would have the information to link the predicate to an information schema to be written.
{
"@context": {
"hapi": "http://hapi-server/rdf/hapi-schema#",
"dctypes": "http://purl.org/dc/dcmitype/",
"foaf": "http://xmlns.com/foaf/0.1/",
"type": {"@type": "@id", "@id": "@type"},
"Dataset": "dctypes:Dataset",
"cadence": "hapi:cadence",
"calibrationLevel": "hapi:calibrationLevel",
"processingVersion": "hapi:processingVersion",
"qualityLevel": "hapi:qualityLevel",
"transform": "hapi:transform",
"parameter": "hapi:parameter",
"missionName": "foaf:Project",
"instrumentName": "foaf:Project",
"sameMissionName": "hapi:sameMissionName",
"isATransformedVersionOf": "hapi:isATransformedVersionOf",
"isCoordinateTransformOf": "hapi:isCoordinateTransformOf",
"isDifferentCalibrationLevelOf": "hapi:isDifferentCalibrationLevelOf",
"isFileListingOf": "hapi:isFileListingOf",
"reviewer": "foaf:Person",
"sameReviewerAs": "hapi:sameReviewerAs",
"conformsTo": {"@type": "@id", "@id": "dcterms:conformsTo"}
}
}
Note: @id
in JSON-LD means the linked-data graph node identifier (so it has to be unique unique in the local context). If you want to have an actual PID, then would have include an extra "identifier" predicate (from "dcterms" or "schema.org").
Of course, this is version rudimentary, and we need to explore in more details. However, from this first example, I would say that it looks rather non-RDF-ic to list the "same[...]As" predicates in the record. This is the job of a graph database ingesting the records, so that it can be queried and kept up-to-date. This is the job of the SPASE (or any future name) registry to list, e.g., what other HAPI datasets contains data from the same mission name. Same as for the datasets with different cadences: it seems more efficient to have a registry to manage such queries.
When building such linked-data resources, the underlying assumption should be that you want to hard-code links to your resources only (same server), since you don't control the URL of the other servers.
(of course, this is a quick and dirty example)
This is very helpful.
Based on what you wrote, I think we have to address another issue. I see that we've identified two types of predicates:
Those that are redundant because the metadata for this already exists and are not required for automatic processing and are more needed for search and discovery. For example, missionName
exists elsewhere. If I wanted to determine the missionName
associated with a HAPI dataset, I could do a SPASE query. Do we want to include this information if it already exists? Some servers may not have SPASE metadata, in which case having it will be useful. However, if we do include it, we are going down the path of developing metadata that goes beyond our stopping point, which is primarily (a) metadata needed for a machine to produce a scientifically sensible plot automatically and (b) metadata needed for science use (contact name, citation info, etc). In this case, someone who wants to build a drop-down menu for a server that does not have SPASE (for example, INTERMAGNET or a non-helio data server) would need to develop menu logic for each server by either querying another SPASE-like database if it exists or inferring relationships based on dataset name. (For example, if datasets were named a/b
, a/c
, the menu could have a top-level of a
and children of b
and c
.)
Those that cannot be determined from existing metadata (and are unlikely to exist in the future) and are required for automatic processing to create a scientifically sensible plot automatically. Examples include "parameterMin
is the min of parameter
in a window given by the cadence of parameterMin
". This information is needed for automatic processing and producing a sensible (in this case, correct) plot when a plot of a long time range is requested.
I suggest that we constrain ourselves to case 2. because we've always tried to avoid building an overarching metadata model and have decided to use existing metadata instead. (All of the issues tag association fall into these two categories). Before proceeding, we should probably clarify our statement in the standard that "the HAPI metadata standard is not intended for complex search and discovery." so we can more easily categorize metadata additions that are out-of-scope. (In particular, we should explain what we mean by "complex".)
The case 2. instances are
otherCadences
(Assuming it is case 2. because the automatic processor wants to overlay the two related datasets, it is case 2. Otherwise, it is case 1.)
"parameter{Min,Max,Ave,Std}
is the {min,max,ave,std}
of parameter
in a window given by the cadence of parameter{Min,Max,Ave,Std}
". An automatic processing algorithm would need to verify that the window for parameter{Min,Max}
calculations are aligned such that the parameter{Min,Max}
can be used in the way Eelco uses it (so if parameter
has cadence of PT1H
and timeStampLocation=center
and time stamps at T00:30
, T01:30
, then parameterMin
must have PT24H
with a timeStameLocation=center
and time stamps at T12:00
).
"parameterFiles
are the files from which data for parameter
was drawn." This could also work at the dataset level.
"parameterBurst
was measured by the same instrument but in a different sampling mode (more channels, for example)." I'd argue that it is a stretch that this would be used for automatic plot generation. For example, would plotting software have an option that says "plot all related datasets"? I think more likely the user would need to discover this from a "complex search and discovery" data bases and we should avoid developing a metadata model that captures this.
Do we want one server with
datasetID
that is numerical data anddatasetID/files
that is URLs and another that uses the conventiondatasetID
andFilesForDatasetID
?Should we have a recomentation? Or would this be addressed by grouping/linking as discussed in #118?