Handling raw and manipulated data

urbanobservatory / standards

Standards and schema documentation for the observatories programme

2 stars 0 forks source link

Handling raw and manipulated data #17

Open birminghamurbanobservatory opened 5 years ago

birminghamurbanobservatory commented 5 years ago

Any suggestions on how to handle observations that are manipulated on the server side.

For example:

An anemometer uploads readings in volts, which are converted to mph on the server side.
Raw temperature readings in °C have a small calibration adjustment applied on the server side, e.g. +0.1°C.

Ideally we'd store both the raw and manipulated data so we can go back and apply a different manipulation if required.

I suspect some end users will be interested in seeing the raw data in the API response so we need to decide how to serve this alongside the final value. I.e. where will it fit in in the following:

{
  "@type": "ssn-ext:ObservationCollection",
  "madeBySensor": "https://example.com/sensor/a",
  "lastObservation": {
    "@id": "https://example.com/sensor/a/no2-concentration/observations/2019-05-31T19:45:00Z",
    "resultTime": "2019-05-31T19:45:00Z",
    "observedProperty": {
      "@id": "https://example.com/property/no2-concentration"
    },
    "hasResult": {
      "@type": ["sosa:Result", "qudt-1-1:QuantityValue"],
      "unit": "something:MicrogramsPerCubicMetre",
      "numericValue": "39.9"
    }
  }
}

Both the raw and manipulated readings have the same:

Sensor (unless it's derived from multiple sensors, e.g. Dew Point temperature can be derived from combining air temperature and relative humidity readings)
resultTime (unless some server-side averaging is taking place)
Feature Of Interest
Observable Property

But could have different:

units
a different value type, e.g. a numeric voltage value of 2.7, could be converted to string, i.e. "low". We already touched on this slightly here.

This could prove to be tricky....

aarepuu commented 5 years ago

I guess one option would be to add the manipulated observations as sample collections - https://www.w3.org/TR/vocab-ssn/#Sampling It's meant to be used when "observations cannot be made directly on the ultimate feature of interest, either because the entire feature cannot be observed, or because it is more convenient to use a proxy", however one could use this also for things that have been manipulated. What do others think?

lukeshope commented 5 years ago

I think what you're describing is the intended use-case for Procedure. The conversion from volts to celsius (or whatever) would be a procedure with its own IRI, and the hasInput and hasOutput could (I think) correspond to two ObservationCollections, one in volts, the latter in celsius.

I would suggest using the value people are most interested in as the madeObservation relation under the Sensor (i.e. celisus), but that would then have a usedProcedure link to the conversion process, which would in-turn link to the raw voltage measurement, which would exist in another collection. :-/

SiBell commented 5 years ago

I'm currently veering towards this approach:

Let's say we have a sensor, a solar panel, with an id of solar-panel-15434 whose raw output is a power measurement in kilowatts. However some smarty-pants works out that you can derive a pretty good solar radiation estimate from it. Therefore you'd end up with 2 ObservationCollection's. Each with slightly different properties:

Observation Collection 1

{
  "madeBySensor": "solar-panel-15434",
  "inDeployment": "solar-panels",
  "hostedBy": "gregs-roof",
  "observedProperty": "solar-power",
  "featureOfInterest": "utilities",
  "units": "kilowatt",
  "members": [
     {"value": 3.4, "time": "2018-09-23T10:34:00Z"},
     {"value": 3.2, "time": "2018-09-23T10:44:00Z"}
  ]
}

Observation Collection 2

{
  "madeBySensor": "solar-panel-15434",
  "inDeployment": "solar-panels",
  "hostedBy": "gregs-roof",
  "observedProperty": "solar-radiation",
  "featureOfInterest": "weather",
  "units": "watts-per-meter-squared",
  "implements": "solar-radiation-from-solar-power",
  "members": [
     {"value": 3.4, "time": "2018-09-23T10:34:00Z"},
     {"value": 3.2, "time": "2018-09-23T10:44:00Z"}
  ]
}

Key points:

The sensor, deployment, platform all stay the same, which makes sense.
We end up with a new observedProperty, and in this case a new featureOfInterest (won't always be the case).
We document the procedure using the implements property. The procedures should really be documented somewhere.
I doubt "members" is the correct key here, but you get the idea.

lukeshope commented 5 years ago

Sounds sensible to me.

Apparently the right term for the Observations in an ObservationCollection is hasMember.

The only thing that isn't clear to me, is whether each Observation should have its own Procedure, with the Input and Output pointing to specific Observations, or if one Procedure could cover the whole timeseries, with the Input potentially being the whole ObservationCollection? Both seem acceptable by the ssn-ext specification.

SiBell commented 5 years ago

Yer the ssn-ext approach looks good to me:

"If present, the value of any of sosa:hasFeatureOfInterest, ssn-ext:hasUltimateFeatureOfInterest, sosa:madeBySensor, sosa:observedProperty, sosa:phenomenonTime, sosa:resultTime, or sosa:usedProcedure apply to all member observations, unless overridden by a value attached directly to the member observation."

And thus you'd think you could do the same with the Input. I.e. at the ObservationCollection it could point to another ObservationCollection. But if specified at the Observation level then it would most likely point to another Observation.

I'm guessing if you're combining temperature and humidity readings from two different ObservationCollections to derive dew point temperature then your Input field becomes an array?

Actually recording all these relationships in the database might be such a pain that we never reach this level of complexity!

Joe-Heffer-Shef commented 4 years ago

Every single data set in the observatories will be manipulated in some way between capture and storage. All the raw data needs to be stored, and ideally be available. There may be also intermediate data sets or aggregated data sets that are useful for some researchers.

Some record of all the transformations applied to the data needs to be documented also, as each data set used will be made up of one or more other data sets, enabling the end user to see a transparent data lineage.

Is it appropriate to have one system that describes the data (the meta-data system) that also documents the data lineage? Would separating the two issues make the technical implementation simpler? Would this same system apply to data lineage for the data pipelines across all the observatories? What happens when data is brought in from third parties, do we just redirect users to them so they can make their own enquiries?

SiBell commented 4 years ago

Hi @Joe-Heffer-Shef, thanks for your thoughts. I agree that we should be storing the raw data, and that data lineage is really important.

I still see the usedProcedure property being a vital tool for this. As an example let's say an air temperature reading has had a small bias correction applied, then a mean average taken. The observation might look like this:

{
  "madeBySensor": "thermistor-abc",
  "resultTime": "2020-02-28T11:16:22.043Z",
  "hasResult": {
    "value": 21.2,
    "unit": "DEG_C"
  },
  "usedProcedure": [
    "uo:PointSample", 
    "https://api.urbanobservatory.ac.uk/procedures/bias-correction-48fna38g", 
    "uo:MeanAverage"
  ],
  "observedProperty": "AirTemperature",
  "hasFeatureOfInterest": "EarthAtmosphere",
  "discipline": ["Meteorology"]
}

Given that more than one procedure will often be applied, it makes sense for usedProcedure to be an array, in which the order is reflective of the order the procedures were applied.

uo:PointSample and uo:MeanAverage would link to a vocabulary of common procedures.

Going to "https://api.urbanobservatory.ac.uk/procedures/bias-correction-48fna38g" will give further details about the specific bias correction being applied. I.e. it applied +0.2°C. God knows how we'll agree on a structure for this procedure object though.

This database of procedures and the microservice that interacts with it could well form the meta-data system your suggesting.

With regards to the raw data, we'd also end up storing and serving an observation that looks like this:

{
  "madeBySensor": "thermistor-abc",
  "resultTime": "2020-02-28T11:16:22.043Z",
  "hasResult": {
    "value": 21.0,
    "unit": "DEG_C"
  },
  "usedProcedure": [
    "uo:PointSample"
  ],
  "observedProperty": "AirTemperature",
  "hasFeatureOfInterest": "EarthAtmosphere",
  "discipline": ["Meteorology"]
}

N.b because this is the raw observation some procedures haven't yet been applied, and thus the value is slightly less (21.0 rather than 21.2).

For third party data, we could allow them to add their own usedProcedures array to any data they send us, we may append more procedures to that array if we have any of our own corrections we need to apply.

SiBell commented 4 years ago

As I start to think about our public-facing front-end, it's clear that we need some way of filtering out raw and uncorrected data.

Let's say for example, we want to show a map of PM10 readings across a city, but many of our sensors are low quality and require a correction based on the current relative humidity. Internally we'd want to keep a record of both the uncorrected and the corrected data, but the public should only see the corrected data. So our observations need a property that indicates this, and which we can used in a query string parameter to only request corrected data.

Here's what I'm thinking:

Uncorrected observation

{
  "madeBySensor": "arduino-pm-sensor",
  "resultTime": "2020-02-28T11:16:22.043Z",
  "hasResult": {
    "value": 100.0,
    "unit": "PartsPerMillion",
    "flag": ["uncorrected"]
  },
  "usedProcedure": [
    "PointSample"
  ],
  "observedProperty": "ParticularMatter10",
  "hasFeatureOfInterest": "EarthAtmosphere",
  "discipline": ["AtmosphericChemistry"]
}

Corrected Observation

{
  "madeBySensor": "arduino-pm-sensor",
  "resultTime": "2020-02-28T11:16:22.043Z",
  "hasResult": {
    "value": 100.0,
    "unit": "PartsPerMillion"
  },
  "usedProcedure": [
    "PointSample",
    "RelativeHumidityCorrection"
  ],
  "observedProperty": "ParticularMatter10",
  "hasFeatureOfInterest": "EarthAtmosphere",
  "discipline": ["AtmosphericChemistry"]
}

Key points:

The uncorrected observation has been assigned a "uncorrected" flag.
The corrected observation has an extra usedProcedure.

This way we can reuse the flag property, that is already filtering out data that is dodgy for other reasons, e.g. those listed in this issue.

Then our front end's API request will look something like:

/observations?discipline=AtmosphericChemistry&observedProperty=ParticularMatter10&flag_exists=false

With the flag_exists=false ensuring we only get healthy data that we're happy to show the public.