Open jbfaden opened 5 years ago
Linked to #106 and #118
We talked about this a little at today's telecon, and here are some options we discussed. These are suggestions only at this point - it's not clear how best to introduce something like this, since we are still always trying to keep things as simple as possible.
Resolution or cadence is one way datasets could be related, but others are pretty common:
This is not to say HAPI should support all of these, especially the iffy ones. Cadence is fundamental to time series data, and so are the positions of in-situ measurements, but if we ever wanted to support other relationships, it would make sense to use a consistent approach for describing those relationships.
We could consider a "relatedDatasets"
block, with different ways for datasets to be related.
"name": "solar_wind_V1_PT1M",
// lots of other required keywords at the dataset level
"relatedDatasets": {
[
{ "name": "solar_wind_V1_P1D",
"server": "URL of other HAPI server?", // would we consider allowing the dataset to live on another server?
"relationship": { "differentCadence" : P1D }
},
{ "name": "solar_wind_V1_P10D",
"relationship": { "differentCadence" : P10D }
},
{ "name": "spacecraft_ephemeris",
"relationship": { "measurementLocation" : "sc_pos_cartesian" } // name of position variables in other dataset?
},
{ "name": "solar_wind_V2_PT1M",
"relationship": { "version": "2.0.0" } // worms exit the can here...
},
{ "name": "solar_wind_V1_PT1M",
"relationship": { "version": "2.0.0" } // I would not want HAPI to put any constraints on the version numbering
},
{ "name": "solar_wind_other_spacecraft_V1_PT1M",
"relationship": { "sameMeasurementDifferentLocation": "MMS2_SC" } // name of other SC in constellation
},
]
Here are possible semantic relationships: https://www.w3.org/TR/vocab-dcat-2/#Class:Relationship
The relationship types could be more complex, so that they could have more than just one value associated with the type:
{ "name": "spacecraft_ephemeris",
"relationship": { "measurementLocation" : { "positionParameter": "sc_pos_cartesian" }
// a full object here is more descriptive and also expandable
}
The DCAT set of relationships is very generic, and doesn't have cadence, for example. Here's a copy of the dcat:Relationship
entity:
dcat:Relationship
--
Usage note: | Used to link to another resource where the nature of the relationship is known but does not match one of the standard [DCTERMS] properties
(dct:hasPart, dct:isPartOf,dct:conformsTo, dct:isFormatOf, dct:hasFormat,
dct:isVersionOf, dct:hasVersion,dct:replaces, dct:isReplacedBy, dct:references,
dct:isReferencedBy, dct:requires,dct:isRequiredBy) or [PROV-O] properties
(prov:wasDerivedFrom, prov:wasInfluencedBy,prov:wasQuotedFrom,
prov:wasRevisionOf, prov:hadPrimarySource, prov:alternateOf,prov:specializationOf).
Used to link to another resource where the nature of the relationship is known
but does not match one of the standard [DCTERMS] properties
(dct:hasPart, dct:isPartOf, dct:conformsTo, dct:isFormatOf, dct:hasFormat, dct:isVersionOf,
dct:hasVersion, dct:replaces, dct:isReplacedBy, dct:references, dct:isReferencedBy, dct:requires,
dct:isRequiredBy) or [PROV-O] properties (prov:wasDerivedFrom, prov:wasInfluencedBy,
prov:wasQuotedFrom, prov:wasRevisionOf, prov:hadPrimarySource, prov:alternateOf,
prov:specializationOf).
Here are some relevant notes from the 2022-11-07 telecon:
For the discussion about linking datasets of different cadences, here's my partial summary, starting with a high level list of possible approaches:
info
response of a single dataset that there are other versions available; this would involve some kind of otherCadences
block that lists the other datasets; complications quickly arise if the other cadence datasetscapabilities
response; averaging is so fundamental for time series data, that allowing servers to optionally support averaging could be reasonableIf datasets are related by different cadences, then in the info
block for high-res DSNAME dataset:
"otherCadences": {
[
{ "server": "URL", "dataset": DSNAME_PT1M },
{ "server": "URL", "dataset": DSNAME_P1D }
}
Note that you should not list the cadence of the other dataset, since that is available in the info
response for that dataset. There are other problems with this approach: the averaged datasets will have extra parameters (avg, min max, std_dev, maybe some uncertainties). This block has to be replicated in multiple info
responses.
What we realized eventually is that these linkages really do not belong in the info
response for a particular dataset, since they are introducing dependencies into that info response from outside. The info
response should only be about the data it is describing. The linkages belong at a higher level, perhaps through another endpoint. This semantics
endpoint would be responsible for capturing the meanings of datasets and parameters, as well as connections between datasets, both at the full dataset level (different cadences with the same exact parameters), and potentially also at the parameter level (these parameters are statistical values from some higher time resolution parameters). This would need some thought. It might be possible to also start including our science data interfaces concepts at this level: this collection of parameters identifies a magnetic field vector; this set represents an energetic particle spectrum; this set is a plasma wave spectrum; this set is a spacecraft ephemeris.
Things we don't want to forget:
cadence
keyword, so we could add a fixedCadence
keyword, or a minCadence
and maxCadence
and if those are the same, then clients can assume that the cadence is fixedHere is the next thing to try: What would this extra endpoint look like for a known case: ground magnetometer data at three cadences: 1sec, 1 min, 1 hour
http://server.org/hapi/semantics
"relationships": [
{ "alternateCadences" :
{ "highestResolution": { "server": "URL", "dataset": "DSNAME_PT1S" },
"otherCadences": [ { "server": "URL", "dataset": "DSNAME_PT1M"},
{ "server": "URL", "dataset": "DSNAME_P1D"} ]
"parameterLinkages": {
// maybe have ways to indicate that parameters in this dataset are averages of the highest resolution?
}
}
}
]
Or just have a way to indicate that parameters
in one dataset are linked to other datasets through various enumerated relationship, like isAverage
or isMin
, or isMax
, or isStdDev
.
"relationships": [
{ "isAverage" : { "source": { "server": "URL", "dataset": "DSNAME"}, "derived": {"server": "URL", "dataset": "DSNAME_PT1M"} } },
{ "isAverage" : { "source": { "server": "URL", "dataset": "DSNAME"}, "derived": {"server": "URL", "dataset": "DSNAME_P1D"} } }
]
This is what I was just playing with:
{
"x_see": "https://github.com/hapi-server/data-specification/issues/78",
"x_notes": "This is not a feature of HAPI, but a proposed new feature.",
"relatedDatasets": [
{
"relationId": "voyager_lowrate_cadence",
"relationship": "cadence",
"parameter": "Amplitude",
"relations": [
{
"id": "project/voyager/lowrate/PT1H",
"cadence": "PT1H"
},
{
"id": "project/voyager/lowrate/PT4S",
"cadence": "PT4S"
}
]
}, {
"relationId": "voyager_lowrate_stats",
"relationship": "intervalStatistics",
"relations": [
{
"id": "project/voyager/lowrate/PT1H",
"parameter": "Amplitude",
"statistic": "average"
},
{
"id": "project/voyager/lowrate/PT1H",
"parameter": "PeakAmplitude",
"statistic": "maximum"
}
]
}
]
}
Here's a potential hapi/semantics
response to inform about datasets related by cadence
{
"HAPI": "3.2",
"status": {"code": 1200, "message": "OK"},
"cadenceVariants": [
[ "ACE_MAG_PT1S", "ACE_MAG_PT1M", "ACE_MAG_P1D" ], // parameter names should be the same!
[ "ACE_PLASMA_PT1S", "ACE_PLASMA_PT1M", "ACE_PLASMA_P1D" ]
],
"cadenceVariants": [ // this flavor of cadence linkages includes interval statistics linkages
{ "groupId": "unique name at this server that refers to this group of cadence-linked data",
"derivedDatasetId": ["ACE_MAG_PT1M", "ACE_MAG_P1D"], // source names map to the same derived names in each variant
// if they don't all match, you would need a separate cadenceVariants structure for each set of names
"sourceId: "ACE_MAG_PT1S",
"intervalStatistics" :
{ "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
"median": {},
"mode": {},
"min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min", "proton_velocity_vector": "vp_min"},
"max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
"stddev": {"Bx": "Bx_stddev" }
}
},
// if your data is on another server, the derived id is not a string, but an object, that includes server URL
{ "derivedId": [ {"server": "server.org/hapi", "id": "ACE_MAG_PT1M" } ] // is derivedId always a list
// if they don't all match, you would need a separate intervalStatistics structure for each set of names
"sourceId: "ACE_MAG_PT1S",
{ "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
"median": {},
"mode": {},
"min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min", "proton_velocity_vector": "vp_min"},
"max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
"stddev": {"Bx": "Bx_stddev" }
}
},
]
}
Here are some refinements on the above concepts.
If you use a new dataset ID for the group of cadence-linked datasets (which should be unique on the server), it should be done for all cadence linked groups. So instead of having two types of entries, just have the intervalStatistics
be an optional part. Maybe use parameterLinkages
instead of intervalStatistics
? This needs more thought / discussion.
There are lots of examples of statistical values at this page about ACE MAG data: http://www.ssg.sr.unh.edu/mag/ace/HourlyParms/HourlyParms.html (way more than what we would want to support. This highlights the potential futility of including specific statistical quantities in related datasets! We will never be able to capture what any particular science team really wants. There is so much variety - are we even capturing 80%? If not, then maybe we don't even try to do parameter linkages?)
This shows two types of cadence linkages: one where the params are assumed to be the same (this shoudl be trivial for server owners to do), and one with parameter re-mappings (there's no way to make this simple).
{
"HAPI": "3.2",
"status": {"code": 1200, "message": "OK"},
"cadenceGroups": [
{ "groupId": "ACE_MAG",
"sourceId": "ACE_MAG_PT1S", // this is the highest cadence dataset
"relatedIds ["ACE_MAG_PT1M", "ACE_MAG_P1D" ], // other cadences; assumes same param names
},
// if the parameter names are not all the same, then you include something to describe the parameter linkages
{ "groupId": "ACE_MAG",
"sourceId": "ACE_MAG_PT1S", // highest cadence dataset
"relatedIds": ["ACE_MAG_PT1M", "ACE_MAG_P1D" ], // other cadences; param mapping is below
"parameterLinkages" : {
{ "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
"median": {}, "mode": {},
"min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min", "proton_velocity_vector": "vp_min"},
"max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
"stddev": {"Bx": "Bx_stddev"},
"other": {}
}
},
{ "groupId": "ACE_SOLAR_WIND",
"sourceId": "ACE_SWIND_PT1S",
"relatedIds": ["ACE_SWIND_PT1M", "ACE_SWIND_P1D" ], // parameter names should be the same!
}
],
// if your data is on another server, the derived id is not a string, but an object, that includes server URL
{ "groupId": "ACE_MAG",
// provide one of "sourceId (for data on this server) or "source" for data on another server
"sourceId": "ACE_MAG_PT1S",
"source": { "server": "http://cdaweb.nasa.gov/hapi", "id": "ACE_MAG_PT1S" },
// provide one of relatedIds (for data on this server) or relatedDatsets for data on one other sersver
"relatedIds": ["ACE_MAG_PT1M", "ACE_MAG_P1D" ], // other cadences; param mapping is below
"relatedDatsets": { "server": "http://tsds.org/hapi", ids: [ "ACE_MAG_PT1M", "ACE_MAG_PT1D"] }
}
]
}
Ephemeris linkages could be done too. These do not need a group name. (I'm starting to think the cadence ones don't either - maybe that's the job of the client to name the groups?)
{
"HAPI": "3.2",
"status": {"code": 1200, "message": "OK"},
"locationGroups": [
{ // this is the dataset with position info
"source": { "server": "optional", "id" : "ACE_SC_POSITION", "param": "id_of_position_vector" }
// these datasets use the indicated source for position data
"relatedIds": ["ACE_MAG_PT1S", "ACE_MAG_PT1M", "ACE_MAG_P1D", "ACE_SWIND_PT1S",
"ACE_SWIND_PT1M", "ACE_SWIND_P1D" ]
// would we want to allow wildcards here? Maybe not at first, but consider it if needed later?
"relatedIds": ["ACE_*"]
}
]
}
There are other ways to do this too. The type of relationship could be specified as a (presumably enumerated) type within a more generic relationship entry:
{
"HAPI": "3.2",
"status": {"code": 1200, "message": "OK"},
"relationships": [
"cadence": { "sourceId":"A", relatedIds": ["B", "C", "D"] },
"location": { "source": { SRC_OBJECT}, "relatedIds": [ LIST_OF_DATASET_IDS_USING_THIS_LOCATION_SRC] }
]
}
What about doing it on a per-dataset basis:
(terrible JSON, but concept is there):
{
"HAPI": "3.2",
"status": {"code": 1200, "message": "OK"},
"relationships": [
"linkage": { "ace-mag-10sec": {
"filelisting": {},
"availability / GTIs": { server: "", id:"" },
"otherCadence": {}
"ephemeris / location": {}
},
"linkage": {
}
]
}
Problem with this: redundancy if you have multiple cadences - does each one have to list all the different linkages?
Another option: naming convention to relate datasets:
myData myData-10sec myData-PT10S myData-ephem
Need to also have a way of communicating that a dataset has an associated file list dataset.
We've talked in the past about how to show where a courser or finer resolution of the data can be found, and also were availability might be found. This is to suggest that in the info there would be:
x_time_finer:DataSetId x_time_courser:DataSetId x_availabilty: DataSetId
Chris at the U. Iowa group has shown that availability is not needed when a very course version of the data is available.