urbanobservatory / standards

Standards and schema documentation for the observatories programme
2 stars 0 forks source link

How will we handle non-numeric observations? #5

Open SiBell opened 5 years ago

SiBell commented 5 years ago

The SSN examples of an Observation result typically show a property called numericValue. Is this too restrictive if we need to store qualitative data too? E.g. what if we need to store a string describing the colour of a vehicle. Should we favour value instead? Or is there an equivalent we should use for strings, e.g. textValue?

I suspect most databases won't like us mixing strings with numbers...

lukeshope commented 5 years ago

Good question, this has made me think about a whole bunch of other things...

On the specific question, numericValue is part of the QUDT ontology ontology, so if the Result's @type is different to qudt-1-1:QuantityValue then you wouldn't have a numericValue key. The Result @type could be xsd:string or schema:image for example instead.

Numeric

{
  "@type": "sosa:Observation",
  [...],
  "hasResult": {
    "@type": ["sosa:Result", "qudt-1-1:QuantityValue"],
    "unit": "qudt-unit-1-1:DegreeCelsius",
    "numericValue": "22.4"
  },
  "resultTime": "2019-05-29T18:30:00Z"
}

String

{
  "@type": "sosa:Observation",
  [...],
  "hasResult": {
    "@type": ["sosa:Result", "xsd:string"],
    "@value": "Fishcake"
  },
  "resultTime": "2019-05-29T18:30:00Z"
}

Image

{
  "@type": "sosa:Observation",
  [...],
  "hasResult": {
    "@type": ["sosa:Result", "schema:image"],
    "@id": "https://file.newcastle.urbanobservatory.ac.uk/camera-feeds/GH_A692B1/20190529/183429.jpg"
  },
  "resultTime": "2019-05-29T18:30:00Z"
}

Disclaimer: I haven't run any of the above through a validator...

Let me know your thoughts on the above examples, but I think it might help to alias numericValue and @value to just value for the sake of simplicity in clients? Multiple contexts should be allowed, the latest one taking precedence. I think this would look like...

{
  "@type": "sosa:Observation",
  [...],
  "hasResult": {
    "@context": {
      "value": "@value"
    },
    "@type": ["sosa:Result", "xsd:string"],
    "value": "Fishcake"
  },
  "resultTime": "2019-05-29T18:30:00Z"
}

I think it's also worth mentioning that it's possible to have enumerated types, such as schema.org's GenderType, which is used in this JSON-LD best practice document. Enumerated types would be preferable over free-strings, but I can think of a few instances we have in Newcastle that would be free text, like the roadside variable message signs.

A more complicated problem... we have some instruments that generate matrices or histograms, like laser precipitation monitors with one dimension representing size, the second velocity, and particles binned into these. Need to think if there's a way to represent that data without a @type with every single value. Our current approach isn't really machine readable, and frankly is pretty confusing to humans alike.

Actions

SiBell commented 5 years ago

Some great suggestions.

I like the idea of using an alias so it's always value.

Enumerated types make sense.

In my mind there's nothing wrong with:

{ "@type": "sosa:Observation", "hasResult": { "@context": { "value": "@value" }, "@type": ["sosa:Result", "uo:laser-precip"], "value": [[0,1,0],[1,1,1],[0,1,0]] }, "resultTime": "2019-05-29T18:30:00Z" }

i.e. the value can just as easily be an array of arrays as it can a string or number, just so long as the @type indicates as such. But not sure whether that's valid in terms of SSN ontology and JSON-LD?

On a side note I assume the @type is an array because it is both of type sosaResult and a uo:laser-precip?

I suspect the reality is that observations of different types will have to be stored in different database collections/tables. i.e. your laser precip meter arrays won't be stored alongside variable message sign strings, but would be merged together in response to an api call for all sensor observations within a given bounding box for example.

lukeshope commented 5 years ago

I think it's valid to have an array of arrays, as the value, as long as we add a custom type that describes them in more detail. I don't think SSN places any restrictions on what you assign as a Result as far as I can tell.

Another troublesome example we've come across is METAR strings and SYNOP codes from present weather sensors. The 'current weather' is generated by considering all of the bits of the string against a dictionary, so "+RA" would be heavy rain. I suppose in these cases we could just have a value type for METAR.

Your understanding of @type is correct. You can have as many as you need, and the order is insignificant. It makes sense that we should probably have our own vocabulary and use these as additional types for sensors and observations to remove any ambiguity.

I think it's up to you how you store the data in the database, but as an example, with the USB API we have different tables for bool, int, string, json, real etc. that are all joined to a timeseries, so a UNION SELECT does the trick if you had no prior knowledge of what type the timeseries was (in reality we always look this up and just hit the right table). We store arrays as json with Postgres, even though Postgres has its own array types. It all works fine for us, because only our own code interacts with the database, and that knows how to handle these tables. The other option is to use a variant type column, or a string, but then you run into problems with arithmetic, averages, roll-ups, precision.

SiBell commented 5 years ago

All makes sense. Thanks for the clarification.

The METAR one is an interesting one. I agree that we should have a value type for METAR. e.g:

{ 
  "@type": "sosa:Observation", 
  "hasResult": { 
    "@context": { "value": "@value" }, 
    "@type": ["sosa:Result", "uo:metar"],
    "value": "METAR LBBG 041600Z 12012MPS 090V150 1400 R04/P1500N R22/P1500U +SN BKN022 OVC050 M04/M07 Q1020 NOSIG 8849//91="
  }, 
  "resultTime": "2019-05-29T18:30:00Z" 
}

But we shouldn't force the end user to perform the conversion to "current weather" themselves. I think we'll have a fair few of these equivalent categorical values. We'll need to standardise a way of including them in the response. In a previous project I was storing battery voltages, but these were fairly meaningless unless you knew the capacity of the battery, it was there the job of the microservice that ingested data from that particular set of devices to assign a categorical equivalent, e.g.

{
  time: '2019-05-29T18:30:00Z',
  sensor: 'road-sensor-01-voltage',
  value: '2.9'
  category: 'low'
}

The front end then knew to use an icon for a low battery. We'll need to decide if both the value and categorical equivalent can exist in the same observation object, or if they need to be kept separate.

Cheers for the heads up regarding UNION SELECT. I'm planning put an internal REST or AMQP interface in front of our TimescaleDB database. Thus, as with your system, only my own code will be making SQL queries.