urbanobservatory / standards

Standards and schema documentation for the observatories programme
2 stars 0 forks source link

Options for giving observations context - cast your votes! #22

Open SiBell opened 4 years ago

SiBell commented 4 years ago

Ok, so let's say you have the following observation:

{
  "madeBySensor": "sensor-123",
  "resultTime": "2020-02-18T17:24:16.094Z"
  "hasResult": {
    "value": 21.2
  },
  "location": {
    "type": "Point",
    "geometry": [-1.9, 52.5]
  }
}

Completely useless! Right? We have no idea what variable is being measured, what the units are, what "thing" the observation relates to, e.g. does it relate to a person, a vehicle, the atmosphere, a building, etc, etc. Is it useful to meteorologists, highways engineers, air pollution experts or facilities managers???

It's clear we need to add some more properties to this observation to give it some context. The properties we add will become a vital part of how we query the data. E.g. allowing us to ask questions such as:

N.b. queries should also allow filtering by time and space, but that is not the focus of this issue.

I'll now present a series of possible solutions to this problem. I.e. different combinations of extra properties that we can add to the observations served by our APIs. N.B. this isn't an exhaustive list, so feel free to post your own suggestions, or merge features from one option with some from another.

Option 1

Key features:

Example 1:

_(N.B. I've omitted properties such as resultTime and location for brevity)_

{
  "madeBySensor": "sensor-123",
  "hasResult": {
    "value": 21.2,
    "unit": "deg_c"
  },
  "observedProperty": "air-temperature",
  "featureOfInterest": "earth-atmosphere",
  "discipline": "Meteorology"
}

Example 2:

{
  "madeBySensor": "sensor-123",
  "hasResult": {
    "value": 21.2,
    "unit": "deg_c"
  },
  "observedProperty": "air-temperature",
  "featureOfInterest": "room-112",
  "discipline": "environment-control"
}

These examples illustrate how it now becomes far easier to differentiate that one observation is an outdoor temperature and the other is an indoor temperature.

In the JSON-LD document detailing our vocabulary we should reference any other vocabularies for which the observedProperty is equivalent, e.g. our air-temperature is equivalent to the CF Standard's definition for air_temperature. It may be we can't find any equivalents and therefore we need to provide our own description.

For this solution we'd only maintain a common vocabulary for observedProperty and discipline. The featureOfInterest is more for the individual observatories to add bespoke tags relevant to their particular deployment. I can see us being a bit limited by the SSN ontology only allowing one featureOfInterest per observation. It would be nice to have ["urban-sciences-building", "room-112"].

What case we use for the property values is also up for debate, e.g. camelCamel, kebab-case, etc. Whatever we choose we should make sure it's URL-friendly.

Here the unit should reference the qudt vocabulary (assuming we can find a match).

Should we allow just one discipline per observation?

We could swap the term discipline for theme instead. Either way it would serve the same function.

Option 2

Key features:

{
  "madeBySensor": "sensor-123",
  "hasResult": {
    "value": 21.2,
    "unit": "deg_c"
  },
  "observedProperty": "air-temperature",
  "featureOfInterest": "atmosphere",
  "qualifier": ["street-level"]
}

So the qualifier acts much like featureOfInterest did in option 1, except we'll allow it to be an array.

Option 3

Key features:

{
  "madeBySensor": "honeywell-device-6a4-thermistor",
  "hasResult": {
    "value": 21.2,
    "unit": "deg_c"
  },
  "observedProperty": "air-temperature",
  "featureOfInterest": "infrastructure",
  "platform": ["urban-sciences-building", "second-floor", "room-112", "honeywell-device-6a4"]
}

So many of us may choose to use the concept of a Platform anyway. E.g. according to the SSN ontology a weather station would be a Platform which has multiple Sensors hosted on it. This solution extends that. This hierarchical approach can be incredibly powerful, but does add a bit of complexity to the database (see ltree's and NO-SQL tree structures)

Option 4

Key features:

This could be used in combination with any of the previous options, with the goal of being able to do without one of the other properties.

Example 1 (modifies the Option 1 example):

{
  "madeBySensor": "sensor-123",
  "hasResult": {
    "value": 21.2,
    "unit": "deg_c"
  },
  "observedProperty": "outdoor-air-temperature",
  "discipline": "Meteorology"
}

Because the observedProperty is far more specific i.e. "outdoor-air-temperature" not "air-temperature" we could perhaps get away without the featureOfInterest or the discipline.

SiBell commented 4 years ago

At Birmingham I've implemented Option 3 and it seems to work pretty well, so this would be my preference. However, when the featureOfInterest's are this broad I feel like discipline is a better term.

SiBell commented 4 years ago

I completely forgot to mention Deployments, which can also help add context to an observation.

inDeployment is a property of a Platform. Using Simon J's water works example, you might have a platform called aeration-tank-sensor-rig which is part of a deployment called cranfield-water-works-research.

When querying our observations we could have a querystring parameter set to inDeployment=cranfield-water-works-research, this would filter the returned observations to just those from the water works.

For full disclosure, I've picked a nice example here. Your deployment could be birmingham-weather-stations and therefore you'd probably want some extra properties for better context, e.g. {featureOfInterest: 'main-library-rooftop'} or {platform: ['main-library', 'rooftop', 'climavue-6']} or qualifier: ['roof-level'].

EttoreHector commented 4 years ago

My preference would be for something like that:

{
  "madeBySensor": "sensor-123",
  "resultTime": "2020-02-18T17:24:16.094Z",
  "hasResult": {
    "value": 21.2
  },
  "location": {
    "type": "Point",
    "geometry": [-1.9, 52.5]
  },
  "observedProperty": "air-temperature",
  "featureOfInterest": "earth-atmosphere",
  "discipline": ["environment-control", "Meteorology", ...] // This is an array
  "qualifier": ["street-level", ...] // This is an array
}

In particular I would:

SiBell commented 4 years ago

With this approach I assume we'd keep a common dictionary of observedPropertys and disciplines, but what about featureOfInterest and qualifier? I suspect it would be a nightmare to maintain and therefore we shouldn't, but we'd need to accept that one observation might use atmosphere and another earth-atmosphere for the same featureOfInterest. Likewise indoor and inside for the same qualifier.

lukeshope commented 4 years ago

Firstly, many thanks Simon for putting together some options and Ettore for your thoughts.

These are my initial thoughts...

Option 1

This looks like a pretty good option to me.

It

On your comment about it being nice to specify the featureOfInterest as both the building and the room, I see the graph as being the solution to this. The featureOfInterest is the room, and the room then describes its own relationship to the building. The only downside is clients have to traverse the graph to find out all of the detail, but in the era of HTTP/2 requests are cheap.

Option 2

This breaks the design pattern of JSON-LD in my mind, and loses the advantages of using vocabulary-referenced keys and values.

It uses qualifier as an array of string expressions that are really unstructured metadata and descriptions. I think we can avoid this, because the JSON-LD objects aren't sealed, you can add whatever additional properties you want. For example:

  "relativeElevation": "street"

or better yet

  "heightAboveSurface": 2.0

Option 3

I see the rationale for this, but I worry that it

By this I mean a temperature sensor in a room may be mounted on a specific wall or may be part of an instrument panel or whatever, but the reason it's there is to represent the room as a whole. The fact it represents the room as a whole (or a zone, whatever the rationale was) is important when looking at the data.

All that said, no problem with platforms that are an entire weather station etc. There is also the option of describing heirarchy as a nested graph rather than arrays, which I think would be more JSONy.

Option 4

Yeah let's not to do this if we can avoid it, because I don't know how we could nail down exactly how granular they should be, and we risk losing the ability to compare across sensors if there are so many observedPropery values.

Summary

My preference is option 1. This has the advantage for me of being clear that air temperature is just air temperature, but if you wanted to make sure you weren't plotting indoor and outdoor temperatures, you would compose a query that only looks at air temperatures with a featureOfInterest of earth-atmosphere (or whatever we end up using).

With regard to Ettore's comments, obviously I'm no fan of qualifier as above, but I don't object to discipline being an array if that proves useful. Array or non-array would both be fine to me. They should dereference to a full IRI for the discipline, so they will be strings against a base IRI presumably, if we all use a common set of disciplines.

lukeshope commented 4 years ago

I thought it might be useful to give a proper example of where having a single FeatureOfInterest can be really useful.

If I have an API that's structured around a smart building, then from the entry point there would be a few logical ways to reach the relevant sensor data:

For the latter, the relationship beween the rooms and the observations would be done through hasFeatureOfInterest.

I've thrown up an actual demo of how this could look here, which is really just the start of me trying to give some better examples for the JSON Schema/Hyper-Schema stuff. I haven't added any schemas yet, it's all pure JSON-LD at this point. Code is here.

SiBell commented 4 years ago

So useful seeing an example API and some code. Thanks for that @lukessmith.

Some thoughts below. Some of which I'm sure you're aware of, but just haven't had the time to implement.

EttoreHector commented 4 years ago

Thank you, Luke, for the example.

Also, you totally convinced me that the use of "qualifiers" is a bad idea, as it disrupt the JSON-LD.

Just a couple of points:

  1. Sometimes we need to use weird descriptions for the ObservedProperty. For example, in some of our traffic cameras, pedestrian traffic is qualified as "towards the city centre" or "from the city centre". I'm not too sure how one would make use of such descriptions. This is actually why I initially thought of introducing "qualifiers". Thinking about it, couldn't we add "qualifiers" to our muo vocabulary?

  2. Shall we allow (or require) "discipline" to be an array (and maybe use the plural form "disciplines")?

lukeshope commented 4 years ago

Thanks Simon and Ettore. You're right in that it's a work in progress so there's more to be done and considered, but building somerthing definitely helps to surface some of the issues.

You've used /room, would using /rooms make more sense? Does it even matter if one observatory uses a singular, and another the plural? I'd personally say we should agree on either singular or plural and stick to it throughout. This gets plenty of debate on StackOverflow.

Good point. It does appear as though the internet is settling on plural as being the convention. I'm keen that we don't end up attributing any semantic value to the paths we use, as it shouldn't make any difference, but I'm happy to go with plurals for the sake of consistency.

There's probably a more fundamental question here about how we do collections: should we return a collection and a view on that collection as a single object? I obviously have in my example, but the argument against doing this would be by having an 'outer' collection, you could describe the collection, within which you have a sub-object that is the view (the ten items on the page you've requested etc.). I think useful descriptions for a collection might include the total number of items in it, or the total number of filtered items (because a filter would still be paginated), and potentially also a list of the types within the collection (which would be a more semantically useful way of saying, this collection only contains rooms, whereas I might have another that only contains features of interest, which in my case would be both rooms and zones).

Seems slightly strange to me that the collection members are an object rather than an array. Is this the JSON-LD way? Likewise isFeatureOfInterestOf is an object.

This is known as node identifier indexing in JSON-LD. In short, these two approaches are identical if they're expanded:

  "member": {
      "https://playground.dev.urbanobservatory.ac.uk/api/room/1.002": {
          "@type": [
              "FeatureOfInterest",
              "Room"
          ],
          "identifier": "1.002",
          "title": "Room 1.002"
      }
  }
  "member": [
      {
          "@id": "https://playground.dev.urbanobservatory.ac.uk/api/room/1.002",
          "@type": [
              "FeatureOfInterest",
              "Room"
          ],
          "identifier": "1.002",
          "title": "Room 1.002"
      }
  ]

My personal preference is that we should use the @container form with objects rather than arrays, purely because it makes writing the JavaScript to process it a bit more logical if you're looking for a specific ID.

Should the observation IDs also include the result time? I.e. so it's a unique ID for that particular observation.

I'm not sure, to be honest. The problem we have is that SSN/SOSA doesn't say anything about having timeseries or historic observations, it's simply not in scope. The ssn-ext ontology does have ObservationCollection types, but doesn't give any examples.

I think you're right, and we probably should probably include either the timestamp in the IRI, or some clear indication that it's the latest observation. The reason I think the latter is an important option, is because we might have some APIs that don't provide access to historic data at all, as in my current example for a USB API. We're likely at Newcastle to separate out the archival of observations from the access to observations, as part of a move towards being more SOA.

For each room, are we able to include a link to all the observations collected in that room? What's the observation that is shown? The latest? Do we need make this clear?

If you're content with that proposal above, then should we introduce a new type in our vocabulary for ObservationLatest? There's probably some other ways we could express it, but I'd rather avoid tagging them in a "latest": true style.

Regarding the CollectionMeta, can it show a link to the next page? Assuming there are more rooms.

I'm planning on extending it to use JSON Hyper-Schema for the pagination in collections. That said, there is always the option of using both a JSON-LD prev/next link and a JSON Hyper-Schema. They wouldn't conflict with each other, and obviously not all clients are going to be able to interpret a JSON Hyper-Schema document (very few, I suspect). The schema approach does have advantages for the filter options though, it provides a machine-readable way of saying "how do I filter this collection to only give me the rooms with a temperature above 21 degrees" in a way that would be quite difficult to do in JSON-LD (unless someone finishes off this bit of the Hydra standard...).

Sometimes we need to use weird descriptions for the ObservedProperty. For example, in some of our traffic cameras, pedestrian traffic is qualified as "towards the city centre" or "from the city centre". I'm not too sure how one would make use of such descriptions. This is actually why I initially thought of introducing "qualifiers". Thinking about it, couldn't we add "qualifiers" to our muo vocabulary?

I think there's a few options for how to go about this:

Shall we allow (or require) "discipline" to be an array (and maybe use the plural form "disciplines")?

I would support this the use of discipline as a property we associate with observations/sensors/platforms/features of interest. I would keep the key name singular, consistent with the other SSN keys, but as above we can make use plurals in the addresses/IRIs, example:

{
  "@context": {
    "uo": "https://urbanobservatory.github.io/standards/vocabulary#",
    "discipline": "uo:discipline",
    "uo-discipline": "https://urbanobservatory.github.io/standards/vocabulary/disciplines#"
  },
  "discipline": [
    "uo-discipline:Transport"
  ]
}

I'm not going to have time to extend the playground code to use JSON Schema in time for tomorrow's call, but I'm sure there's plenty we can discuss based on the above. Pull requests very much welcome if you want to tweak the code based on the above.

Thanks again for your comments.