urbanobservatory / standards

Standards and schema documentation for the observatories programme
2 stars 0 forks source link

Being more specific about max/min/average observations #31

Open SiBell opened 4 years ago

SiBell commented 4 years ago

Many of us will have observations that aren't a simple instantaneous point sample, whether that's because of processing we've done, or this is the only format third-party systems give us the data in. These will be things like maximum and minimum values over a given time frame, or perhaps a mean average. It's important that the end user is aware of this otherwise you run the risk of max, min and average values also being shown together in the same line of a line graph (for example), when really they should be separate lines.

This challenge has been raised before, but following today's technical call I wanted to outline another approach that was discussed. Here's a breakdown of the options available to us:

1. Solely use the usedProcedures array

With this solution the usedProcedures would include either "Maximum", "Minimum", "MeanAverage", etc, when such a procedure has been used. We'd want to keep these consistent across observatories. The end-user or front-end application could then check for the presence of these particular procedures. There could also be some more bespoke, observatory-specific procedures in this array too.

{
  "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002",
  "@type": "Observation",
  "resultTime": "2020-04-28T11:39:00.000Z",
  "hasResult": {
     "value": 6.9,
     "unit": "DegreeCelsius"
  },
  "phenomenonTime": {
    "hasBeginning": "2020-04-28T11:30:00.000Z",
    "hasEnd": "2020-04-28T11:40:00.000Z"
  },
  "madeBySensor": "thermistor-a7h2",
  "observedProperty": {
    "@id": "AirTemperature",
    "@type": "sosa:ObservableProperty",
    "label": "air temperature"
  },
  "usedProcedures": [
    "Maximum"                 <------------------
  ]
}

N.B. For all these examples the time interval over which the maximum applies is specified using the phenomenonTime object.

2. Add an extra @type to the observedProperty object

This was the approach suggested by @lukessmith in today's technical call.

In this approach the @type of the observedProperty object becomes an array, because not only is this observed property of type soaa:ObservableProperty, but we also make it clear that this is a maximum using the uo:Maximum type. This frees us up to be a bit more specific with our usedProcedure.

The downside of this approach is we'll likely need to add another column/property to our database to capture this type.

{
  "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002",
  "@type": "Observation",
  "resultTime": "2020-04-28T11:39:00.000Z",
  "hasResult": {
     "value": 6.9,
     "unit": "DegreeCelsius"
  },
  "phenomenonTime": {
    "hasBeginning": "2020-04-28T11:30:00.000Z",
    "hasEnd": "2020-04-28T11:40:00.000Z"
  },
  "madeBySensor": "thermistor-a7h2",
  "observedProperty": {
    "@id": "AirTemperature",
    "@type": ["sosa:ObservableProperty", "uo:Maximum"],    <------------------
    "label": "air temperature"
  },
  "usedProcedures": [
    "5-minute-max-from-1-minute-samples"
  ]
}

Other types could be uo:Minimum, uo:MeanAverage, any more?

3. Create completely new observed properties

In this approach I've created a completely new observable property with an id of AirTemperatureMaximum specifically for maximum air temperature readings. I've actually kept the extra "uo:Maximum" type, as it provides a nice way to denote which of these new observed properties are maximums, and which are minimums, etc, and this could be defined in the common vocabulary.

{
  "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002",
  "@type": "Observation",
  "resultTime": "2020-04-28T11:39:00.000Z",
  "hasResult": {
     "value": 6.9,
     "unit": "DegreeCelsius"
  },
  "phenomenonTime": {
    "hasBeginning": "2020-04-28T11:30:00.000Z",
    "hasEnd": "2020-04-28T11:40:00.000Z"
  },
  "madeBySensor": "thermistor-a7h2",
  "observedProperty": {
    "@id": "AirTemperatureMaximum",                          <------------------
    "@type": ["sosa:ObservableProperty", "uo:Maximum"],
    "label": "air temperature maximum"
  },
  "usedProcedures": [
    "5-minute-max-from-1-minute-samples"
  ]
}

The downside of this approach is we'd quickly end up tripling the length of observed properties list, potentially duplicating much of what we defined for AirTemperature.

SiBell commented 4 years ago

Been giving this a bit more thought today. I'm wondering if the approach should actually look more like this:

{
  "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002",
  "@type": "Observation",
  "timeseries": "8j3e92k",
  "resultTime": "2020-04-28T11:39:00.000Z",
  "hasResult": {
     "value": 6.9,
     "unit": "DegreeCelsius"
  },
  "phenomenonTime": {
    "hasBeginning": "2020-04-28T11:30:00.000Z",
    "hasEnd": "2020-04-28T11:40:00.000Z"
  },
  "madeBySensor": "thermistor-a7h2",
  "observedProperty": {
    "@id": "AirTemperature",
    "@type": "sosa:ObservableProperty",
    "label": "air temperature"
  },
  "usedProcedures": [
    {
      "@id": "simple-offset-bias-correction",
      "@type": "BiasCorrection",
      "label": "bias corrected",
      "comment": "A simple constant offset is applied to all values"
    },
    {
      "@id": "5-minute-max-from-1-minute-samples",
      "@type": "Maximum",
      "label": "5 minute maximum",
      "description": "The maximum value over a 5 minute window, is select from 1 minute samples"
    }
  ]
}

Key points:

EttoreHector commented 4 years ago

Thank you, Simon.

The last approach makes more sense to me. Just a question. How would a response to a time series request look like? Also, are you suggesting we query Observations only when we want to retrieve single values from any set of sensors (like, for example, the last value recorded), and Timeseries when we ask for 2 or more values from the same sensor?

On Wed, 29 Apr 2020, 16:08 Si Bell, notifications@github.com wrote:

Been giving this a bit more thought today. I'm wondering if the approach should actually look more like this:

{ "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002", "@type": "Observation", "timeseries": "8j3e92k", "resultTime": "2020-04-28T11:39:00.000Z", "hasResult": { "value": 6.9, "unit": "DegreeCelsius" }, "phenomenonTime": { "hasBeginning": "2020-04-28T11:30:00.000Z", "hasEnd": "2020-04-28T11:40:00.000Z" }, "madeBySensor": "thermistor-a7h2", "observedProperty": { "@id": "AirTemperature", "@type": "sosa:ObservableProperty", "label": "air temperature" }, "usedProcedures": [ { "@id": "simple-offset-bias-correction", "@type": "BiasCorrection", "label": "bias corrected", "comment": "A simple constant offset is applied to all values" }, { "@id": "5-minute-max-from-1-minute-samples", "@type": "Maximum", "label": "5 minute maximum", "description": "The maximum value over a 5 minute window, is select from 1 minute samples" } ] }

Key points:

  • The generic types, e.g. Maximum, Average, BiasCorrection now apply to the procedures not the observed properties. We could add a query parameter that allows us to filter out any observations with a procedure type we don't want.
  • By populating the procedures, i.e. including their label and description, you have a nice human-friendly bit of text to show the user exactly what processing has occurred to produce this observation.
  • I've added a timeseries property to the observation. Typically any observations with the same timeseries ID should be shown as a single line on a line graph. You could have an observation that looks exact the same as the one above, except that the 5 minute maximum was instead a daily maximum. This would mean it has a different timeseries id, and could be used to ensure that the value was plotted in a different colour on any graphs.
  • I'm fully aware that this observation has an awful lot of metadata included, and it may not be practical to include this much information when asking for 100's of observation in one go. This could be where the timeseries approach comes in. You first ask for what timeseries are available, then ask for the stripped down observations for this timeseries ( /timeseries/:timeseriesId/observations). All the observations in a timeseries will have exactly the same values for observedProperty, unit, disciplines, hasFeatureOfInterest, usedProcedures, madeBySensor so there's no point in repeating these in every single observation.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/urbanobservatory/standards/issues/31#issuecomment-621273121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6X6YL64DWF3S3LZICPFXLRPA7HBANCNFSM4MSZJPDQ .

SiBell commented 4 years ago

Hi Ettore,

I've mostly implemented the timeseries endpoints on my API so you can actually have a gander at some real data:

I still need to populate a few more of the fields, e.g. so observedProperty is an object rather than just an ID string. Hopefully I'll get this implemented later today.

SiBell commented 4 years ago

Also, are you suggesting we query Observations only when we want to retrieve single values from any set of sensors (like, for example, the last value recorded), and Timeseries when we ask for 2 or more values from the same sensor?

Yer pretty much. So if a user wants to get the latest air temperature observations to show on a map they could make a request like this:

https://api.birminghamurbanobservatory.com/observations?observedProperty=AirTemperature&onePer=timeseries

This gets one observation per timeseries, but you could also do onePer=sensor. That's not to say that you can't still get more than one observation per sensor/timeseries this way, you just have to omit the onePer parameter, but I'm struggling to see a good use case, other than just having a nosey at what's available.

Then if you want to see a history of observations from that timeseries you would then call:

https://api.birminghamurbanobservatory.com/timeseries/zBO/observations

SiBell commented 4 years ago

Unfortunately I've been giving the usedProcedures more thought (not good for my sanity), and had a good chat with my front-end developer about it. Here's our thoughts:

When it comes to making line graphs of the data there's no problem. We can essentially show the end-user a pretty list of the available timeseries and they can choose what they plot. E.g. they can choose if they want to show the timeseries that uses a 5 minute averaging procedure, or perhaps they choose a timeseries with the daily maximums. They might choose to show 3 timeseries on a graph together, e.g the 5 minute maximums, the 5 minute averages, and the 5 minute minimums. You would plot each timeseries as a separate line, in a separate colour, and it'll be fairly clear (with the help of a legend) which is which. Great. Easy.

Maps HOWEVER, are a whole different kettle of fish! A common thing many of us will want to do is show a nice map of recent air temperature observations, so the end-user can see which parts of the city are warmer than others. However you might have deployed air temperature sensors from a dozen different manufacturers, all with slightly different procedures, some take instantaneous measurements, others take 1 minutes averages, others 5 minute averages. Therefore when you make the following request:

https://api.birminghamurbanobservatory.com/observations?observedProperty=AirTemperature&onePer=timeseries

... you could get back observations with a wide variety of procedures, here's some examples:

. usedProcedures suitable for use on the map
observation 1 ["instantaneous-sample"] yes
observation 2 ["1-min-avg-of-1-sec-samples"] yes
observation 3 ["solar-insolation-bias-correction", "5-min-avg-of-1-min-samples"] yes
observation 4 [ "daily-max-of-hourly-samples"] no
observation 5 ["5-min-max-of-1-min-samples"] no

Now every map will be different, in my map of air temperatures I'd probably be happy to show an hourly average air temperature to the public, however some third party app that uses our API only feels comfortable showing average air temperatures if they're an average over 10 minutes or less. They may also prefer not to show observations which required a "solar-insolation-bias-correction" procedure to be applied.

The point I'm making is that we don't have the authority to decide what's map-worthy and what's not. Therefore we can't just add a query parameter ?suitableForMap=true to make life easier for the end users, because later down the line we might add a new sensor that takes 30 minute averages, we decided it's suitable for the map, the third party app assumes it's ok, when in fact it's not appropriate for their application.

So... as painful as it may be, our requests might have to look like this (at least for applications sensitive to the procedures being used):

https://api.birminghamurbanobservatory.com/observations?observedProperty=AirTemperature&onePer=timeseries&usedProcedures__include=instantaneous-sample,1-min-avg-of-1-sec-samples,5-min-avg-of-1-min-samples,solar-insolation-bias-correction

I.e we specifically list the acceptable usedProcedures, and therefore we're not vulnerable if any new sensors, with unacceptable procedures come online at a later date.

The challenge then is ensuring users are aware of any new usedProcedures that come online so they can decide if they want to include them so they don't miss out on the data.

Sorry for opening such a miserable can of worms. If you have a more elegant solution I'd be all ears.

SiBell commented 4 years ago

Some additions that might make life easier for us:

{
  "@id": "debcc9ab-3820-4e86-bf2b-d5fa6d5f8002",
  "@type": "Observation",
  "timeseries": "8j3e92k",
  "resultTime": "2020-04-28T11:39:00.000Z",
  "hasResult": {
     "value": 6.9,
     "unit": "DegreeCelsius"
  },
  "phenomenonTime": {
    "hasBeginning": "2020-04-28T11:30:00.000Z",
    "hasEnd": "2020-04-28T11:35:00.000Z",
    "interval": 300    <-------------------------------------------------
  },
  "madeBySensor": "thermistor-a7h2",
  "observedProperty": {
    "@id": "AirTemperature",
    "@type": ["sosa:ObservableProperty", "uo:Maximum"],     <------------
    "label": "air temperature"
  },
  "usedProcedures": [
    {
      "@id": "simple-offset-bias-correction",
      "label": "bias corrected",
      "comment": "A simple constant offset is applied to all values"
    },
    {
      "@id": "5-minute-max-from-1-minute-samples",
      "label": "5 minute maximum",
      "description": "The maximum value over a 5 minute window, is select from 1 minute samples"
    }
  ]
}

So I've reintroduced the option 2 approach and I've added an interval property describing the interval length (in seconds) over which the maximum was calculated.

I don't think there's many types we'd need to worry about:

type description
Instant For an instantaneous sample at a single point in time. Such observations wouldn't have a phenomenonTime object.
Average Covers any type of average. To specify whether it's a mean, mode or median average then use a usedProcedure.
Maximum The maximum value recorded over the interval.
Minimum The minimum value recorded over the interval.
Sum The sum of values recorded over the interval, for example hourly rainfall accumulations could be summed up to give a daily sum.
Count The total count over the interval. For example a sensor might create an instantaneous observation every time a single pedestrian walks by. If our servers did some post-processing and counted up all these occurrences the generate observation would be of type Count. I think this differs enough from Sum to be worth of its own type?

Crucially we'd need some query parameters that allow us to filter by this sampling type and also by the length of the interval. Something like:

GET https://api.example.com/observations?observedProperty=AirTemperature&observedPropertyType__in=instant,average&interval__lte=600

For when we want Air Temperature observations and we're happy to accept both instant and average types, as long as the interval is less than 10 minutes.

SiBell commented 4 years ago

Use duration rather than interval.

SiBell commented 4 years ago

Another type: Range, i.e. the maximum value within the time frame minus the minimum.

SiBell commented 4 years ago

When duration is give as a number, then it is in seconds. Alternatively it can be given as a string using a ISO8601 duration format, e.g. P1H for 1 hour. It was generally agreed that we should stick with just seconds, minutes and hours for simplicity.

Some extra resources regarding the duration string format:

SiBell commented 4 years ago

I was a little inconsistent when it came to querying by the type before. If the type we want was uo:Maximum then the query would look like:

.../observations?observedPropertyType=Maximum

I.e. it has a capital M. We allow the prefix uo: to be omitted.

SiBell commented 4 years ago

Other possible aggregation types: Variance and StandardDeviation.

God knows if any of us will use them...

SiBell commented 4 years ago

I've somewhat implemented the addition of this aggregation information and the phenomenonTime duration. Examples here:

I have found it easier to have the aggregation method as a separate property, rather than have it as a @type of the observedProperty. I'd be interested to hear how people feel about this.