datasets related by resolution or availability

jbfaden commented 5 years ago

We've talked in the past about how to show where a courser or finer resolution of the data can be found, and also were availability might be found. This is to suggest that in the info there would be:

x_time_finer:DataSetId x_time_courser:DataSetId x_availabilty: DataSetId

Chris at the U. Iowa group has shown that availability is not needed when a very course version of the data is available.

jbfaden commented 3 years ago

sandyfreelance commented 3 years ago

Linked to #106 and #118

jvandegriff commented 2 years ago

We talked about this a little at today's telecon, and here are some options we discussed. These are suggestions only at this point - it's not clear how best to introduce something like this, since we are still always trying to keep things as simple as possible.

Resolution or cadence is one way datasets could be related, but others are pretty common:

this dataset has the associated positions of the measurement (spacecraft ephemeris, ship track, airplane route)
similar or maybe same exact dataset from same instrument on different s/c in a constellation
different versions (new calibration file triggers reprocessing)
(this is getting iffy) different background subtraction technique (SuperMAG has this)
(stiff iffy) similar dataset but from a follow-on mission (more radiation belt electron measurements)
(also iffy and starting to be something else altogether) part of a collection of datasets that were useful for a particular kind of study (different measurements of the same phenomenon)

This is not to say HAPI should support all of these, especially the iffy ones. Cadence is fundamental to time series data, and so are the positions of in-situ measurements, but if we ever wanted to support other relationships, it would make sense to use a consistent approach for describing those relationships.

We could consider a "relatedDatasets" block, with different ways for datasets to be related.

"name": "solar_wind_V1_PT1M",
// lots of other required keywords at the dataset level
"relatedDatasets": {
       [
          { "name": "solar_wind_V1_P1D",
             "server": "URL of other HAPI server?",   // would we consider allowing the dataset to live on another server?
             "relationship": { "differentCadence" : P1D }
          },
          { "name": "solar_wind_V1_P10D",
             "relationship": { "differentCadence" : P10D }
          },
          { "name": "spacecraft_ephemeris",
             "relationship": { "measurementLocation" : "sc_pos_cartesian" } // name of position variables in other dataset?
          },
          { "name": "solar_wind_V2_PT1M",
             "relationship": { "version": "2.0.0" }   // worms exit the can here...
          },
          { "name": "solar_wind_V1_PT1M",
             "relationship": { "version": "2.0.0" }   // I would not want HAPI to put any constraints on the version numbering
          },
          { "name": "solar_wind_other_spacecraft_V1_PT1M",
             "relationship": { "sameMeasurementDifferentLocation": "MMS2_SC" } // name of other SC in constellation
          },
]

Here are possible semantic relationships: https://www.w3.org/TR/vocab-dcat-2/#Class:Relationship

The relationship types could be more complex, so that they could have more than just one value associated with the type:

          { "name": "spacecraft_ephemeris",
             "relationship": { "measurementLocation"  : {  "positionParameter": "sc_pos_cartesian" }
                                                                                    // a full object here is more descriptive and also expandable
          }

jvandegriff commented 1 year ago

The DCAT set of relationships is very generic, and doesn't have cadence, for example. Here's a copy of the dcat:Relationship entity:

dcat:Relationship
--
Usage note: | Used to link to another resource where the nature of the relationship is known but does not match one of the standard [DCTERMS] properties 
(dct:hasPart, dct:isPartOf,dct:conformsTo, dct:isFormatOf, dct:hasFormat, 
dct:isVersionOf, dct:hasVersion,dct:replaces, dct:isReplacedBy, dct:references, 
dct:isReferencedBy, dct:requires,dct:isRequiredBy) or [PROV-O] properties 
(prov:wasDerivedFrom, prov:wasInfluencedBy,prov:wasQuotedFrom,
 prov:wasRevisionOf, prov:hadPrimarySource, prov:alternateOf,prov:specializationOf).

Used to link to another resource where the nature of the relationship is known 
but does not match one of the standard [DCTERMS] properties 
(dct:hasPart, dct:isPartOf, dct:conformsTo, dct:isFormatOf, dct:hasFormat, dct:isVersionOf, 
dct:hasVersion, dct:replaces, dct:isReplacedBy, dct:references, dct:isReferencedBy, dct:requires, 
dct:isRequiredBy) or [PROV-O] properties (prov:wasDerivedFrom, prov:wasInfluencedBy, 
prov:wasQuotedFrom, prov:wasRevisionOf, prov:hadPrimarySource, prov:alternateOf, 
prov:specializationOf).

jvandegriff commented 1 year ago

Here are some relevant notes from the 2022-11-07 telecon:

For the discussion about linking datasets of different cadences, here's my partial summary, starting with a high level list of possible approaches:

datasets can be linked linked via a post-fix on the dataset name using an ISO duration; this works now, but is too fragile / difficult to enforce, so no one wants to do just this; we still think datasets should be named this way for clarity
advertising within the info response of a single dataset that there are other versions available; this would involve some kind of otherCadences block that lists the other datasets; complications quickly arise if the other cadence datasets
we could also link individual parameters in a similar way: this parameter has another cadence version available in this other dataset (and it also has another name in that dataset, and by the way, here is what the other cadence is)
we could expose different cadences as possible discrete filters on a standardized averaging filter that is exposed in the capabilities response; averaging is so fundamental for time series data, that allowing servers to optionally support averaging could be reasonable
there was interest by several folks (Jeremy, Doug, and Eelco's software used this) to have averaged datasets also offer min/max/std_dev, etc; so there was interest in being able to express that certain parameters are actually statistical quantities resulting from other parameters in another dataset

If datasets are related by different cadences, then in the info block for high-res DSNAME dataset:

"otherCadences": {
[ 
  { "server": "URL",  "dataset": DSNAME_PT1M },
  { "server": "URL",  "dataset": DSNAME_P1D }
}

Note that you should not list the cadence of the other dataset, since that is available in the info response for that dataset. There are other problems with this approach: the averaged datasets will have extra parameters (avg, min max, std_dev, maybe some uncertainties). This block has to be replicated in multiple info responses.

What we realized eventually is that these linkages really do not belong in the info response for a particular dataset, since they are introducing dependencies into that info response from outside. The info response should only be about the data it is describing. The linkages belong at a higher level, perhaps through another endpoint. This semantics endpoint would be responsible for capturing the meanings of datasets and parameters, as well as connections between datasets, both at the full dataset level (different cadences with the same exact parameters), and potentially also at the parameter level (these parameters are statistical values from some higher time resolution parameters). This would need some thought. It might be possible to also start including our science data interfaces concepts at this level: this collection of parameters identifies a magnetic field vector; this set represents an energetic particle spectrum; this set is a plasma wave spectrum; this set is a spacecraft ephemeris.

Things we don't want to forget:

we need a way to indicate that a datasets has a fixed, regular cadence (actually, why do we need this? client should always code defensively against "phase shifts" in supposedly regular data. I think this is an analysis issue and not a data serving issue)
we already have a cadence keyword, so we could add a fixedCadence keyword, or a minCadence and maxCadence and if those are the same, then clients can assume that the cadence is fixed

Here is the next thing to try: What would this extra endpoint look like for a known case: ground magnetometer data at three cadences: 1sec, 1 min, 1 hour

http://server.org/hapi/semantics

"relationships": [
 { "alternateCadences" :
     { "highestResolution": { "server": "URL", "dataset": "DSNAME_PT1S" },
       "otherCadences": [ { "server": "URL", "dataset": "DSNAME_PT1M"},
                          { "server": "URL", "dataset": "DSNAME_P1D"} ]
       "parameterLinkages": {
            // maybe have ways to indicate that parameters in this dataset are averages of the highest resolution?
        }
     }
  }
]

Or just have a way to indicate that parameters in one dataset are linked to other datasets through various enumerated relationship, like isAverage or isMin, or isMax, or isStdDev.

"relationships": [
 { "isAverage" : { "source": { "server": "URL", "dataset": "DSNAME"}, "derived": {"server": "URL", "dataset": "DSNAME_PT1M"} } },
 { "isAverage" : { "source": { "server": "URL", "dataset": "DSNAME"}, "derived": {"server": "URL", "dataset": "DSNAME_P1D"} } }
]

jbfaden commented 1 year ago

This is what I was just playing with:

{ 
    "x_see": "https://github.com/hapi-server/data-specification/issues/78",
    "x_notes": "This is not a feature of HAPI, but a proposed new feature.",
    "relatedDatasets": [
        {
            "relationId": "voyager_lowrate_cadence",
            "relationship": "cadence",
            "parameter": "Amplitude",
            "relations": [
                { 
                    "id": "project/voyager/lowrate/PT1H", 
                    "cadence": "PT1H"
                },
                {
                    "id": "project/voyager/lowrate/PT4S", 
                    "cadence": "PT4S"
                }
            ]
        }, {
            "relationId": "voyager_lowrate_stats",
            "relationship": "intervalStatistics",
            "relations": [
                { 
                    "id": "project/voyager/lowrate/PT1H", 
                    "parameter": "Amplitude",
                    "statistic": "average"
                },
                {
                    "id": "project/voyager/lowrate/PT1H", 
                    "parameter": "PeakAmplitude",
                    "statistic": "maximum"
                }
            ]
        }
    ]
}

jvandegriff commented 1 year ago

Here's a potential hapi/semantics response to inform about datasets related by cadence

{
  "HAPI": "3.2",
  "status": {"code": 1200, "message": "OK"},
   "cadenceVariants": [
            [ "ACE_MAG_PT1S", "ACE_MAG_PT1M", "ACE_MAG_P1D" ],   // parameter names should be the same!
            [ "ACE_PLASMA_PT1S", "ACE_PLASMA_PT1M", "ACE_PLASMA_P1D" ]
         ],
   "cadenceVariants": [ // this flavor of cadence linkages includes  interval statistics linkages
           {   "groupId": "unique name at this server that refers to this group of cadence-linked data",
               "derivedDatasetId":  ["ACE_MAG_PT1M", "ACE_MAG_P1D"], // source names map to the same derived names in each variant
               // if they don't all match, you would need a separate cadenceVariants structure for each set of names
              "sourceId: "ACE_MAG_PT1S",
               "intervalStatistics" : 
                       {  "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
                           "median": {},
                           "mode": {},
                           "min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min",  "proton_velocity_vector": "vp_min"},
                           "max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
                           "stddev": {"Bx": "Bx_stddev" }
                        }
          },
          // if your data is on another server, the derived id is not a string, but an object, that includes server URL
           { "derivedId":  [ {"server": "server.org/hapi", "id": "ACE_MAG_PT1M" } ] // is derivedId always a list
               // if they don't all match, you would need a separate intervalStatistics structure for each set of names
              "sourceId: "ACE_MAG_PT1S",
                       {  "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
                           "median": {},
                           "mode": {},
                           "min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min",  "proton_velocity_vector": "vp_min"},
                           "max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
                           "stddev": {"Bx": "Bx_stddev" }
                        }
          },

     ]
}

jvandegriff commented 1 year ago

Here are some refinements on the above concepts.

If you use a new dataset ID for the group of cadence-linked datasets (which should be unique on the server), it should be done for all cadence linked groups. So instead of having two types of entries, just have the intervalStatistics be an optional part. Maybe use parameterLinkages instead of intervalStatistics? This needs more thought / discussion.

There are lots of examples of statistical values at this page about ACE MAG data: http://www.ssg.sr.unh.edu/mag/ace/HourlyParms/HourlyParms.html (way more than what we would want to support. This highlights the potential futility of including specific statistical quantities in related datasets! We will never be able to capture what any particular science team really wants. There is so much variety - are we even capturing 80%? If not, then maybe we don't even try to do parameter linkages?)

This shows two types of cadence linkages: one where the params are assumed to be the same (this shoudl be trivial for server owners to do), and one with parameter re-mappings (there's no way to make this simple).

{
  "HAPI": "3.2",
  "status": {"code": 1200, "message": "OK"},
   "cadenceGroups": [
            { "groupId": "ACE_MAG",
               "sourceId": "ACE_MAG_PT1S", // this is the highest cadence dataset
               "relatedIds  ["ACE_MAG_PT1M", "ACE_MAG_P1D" ],   // other cadences; assumes same param names
             },
            // if the parameter names are not all the same, then you include something to describe the parameter linkages
             { "groupId": "ACE_MAG",
               "sourceId": "ACE_MAG_PT1S", // highest cadence dataset
               "relatedIds":  ["ACE_MAG_PT1M", "ACE_MAG_P1D" ], // other cadences; param mapping is below
                "parameterLinkages" : {
                       {  "mean" : { "Bx": "Bx_avg", "By": "By_avg", "Bz": "Bz_avg", "proton_velocity_vector" : "vp_avg"},
                          "median": {},  "mode": {},
                          "min" : { "Bx": "Bx_min", "By": "By_min", "Bz": "Bz_min",  "proton_velocity_vector": "vp_min"},
                          "max": { "Bx": "Bx_max", "By": "By_max", "Bz": "Bz_max"},
                          "stddev": {"Bx": "Bx_stddev"},
                          "other": {} 
                        }
                 },
             { "groupId": "ACE_SOLAR_WIND",
               "sourceId": "ACE_SWIND_PT1S",
               "relatedIds":  ["ACE_SWIND_PT1M", "ACE_SWIND_P1D" ],   // parameter names should be the same!
             }
         ],
        // if your data is on another server, the derived id is not a string, but an object, that includes server URL
           {  "groupId":  "ACE_MAG",
  // provide one of "sourceId (for data on this server) or "source" for data on another server
              "sourceId": "ACE_MAG_PT1S",
              "source": { "server": "http://cdaweb.nasa.gov/hapi", "id": "ACE_MAG_PT1S" },

// provide one of relatedIds (for data on this server) or relatedDatsets for data on one other sersver
                "relatedIds":  ["ACE_MAG_PT1M", "ACE_MAG_P1D" ], // other cadences; param mapping is below
                "relatedDatsets": { "server": "http://tsds.org/hapi", ids: [ "ACE_MAG_PT1M", "ACE_MAG_PT1D"] }
            }
     ]
}

Ephemeris linkages could be done too. These do not need a group name. (I'm starting to think the cadence ones don't either - maybe that's the job of the client to name the groups?)

{
  "HAPI": "3.2",
  "status": {"code": 1200, "message": "OK"},
   "locationGroups": [
            {   // this is the dataset with position info
                "source": { "server": "optional", "id" : "ACE_SC_POSITION", "param": "id_of_position_vector" }
                 // these datasets use the indicated source for position data
               "relatedIds":  ["ACE_MAG_PT1S", "ACE_MAG_PT1M", "ACE_MAG_P1D", "ACE_SWIND_PT1S",
                                       "ACE_SWIND_PT1M", "ACE_SWIND_P1D" ]
                // would we want to allow wildcards here?  Maybe not at first, but consider it if needed later?
                "relatedIds": ["ACE_*"]
             }
     ]
}

There are other ways to do this too. The type of relationship could be specified as a (presumably enumerated) type within a more generic relationship entry:

{
  "HAPI": "3.2",
  "status": {"code": 1200, "message": "OK"},
   "relationships": [
        "cadence": { "sourceId":"A",  relatedIds": ["B", "C", "D"] },
        "location": { "source": { SRC_OBJECT}, "relatedIds": [ LIST_OF_DATASET_IDS_USING_THIS_LOCATION_SRC] }
    ]
}

jvandegriff commented 1 year ago

What about doing it on a per-dataset basis:

(terrible JSON, but concept is there):

{
  "HAPI": "3.2",
  "status": {"code": 1200, "message": "OK"},
   "relationships": [
        "linkage": { "ace-mag-10sec": { 
                                "filelisting": {},
                                "availability / GTIs": { server: "", id:"" },
                                "otherCadence": {} 
                                "ephemeris / location": {}
                              },
       "linkage": {
        }
    ]
}

Problem with this: redundancy if you have multiple cadences - does each one have to list all the different linkages?

jvandegriff commented 1 year ago

Another option: naming convention to relate datasets:

myData myData-10sec myData-PT10S myData-ephem

rweigel commented 8 months ago

Need to also have a way of communicating that a dataset has an associated file list dataset.

hapi-server / data-specification

datasets related by resolution or availability #78