opengeospatial / sensorthings

The official web site of the OGC SensorThings API standard specification.
134 stars 29 forks source link

[FeatureRequest] Support for OData Extension for Data Aggregation Version 4.0 #71

Open riedel opened 5 years ago

riedel commented 5 years ago

See the related specification

SensorThings is atm hardly usable in dashboard applications if the sampling frequency is high. Currently we have to get 2.5MB to display a chart for one day (15s sampling). Aggregation should be doable on the server side.

Support for fitting grouping function would also be needed in this case: A function such as time_bucket would be highly welcome.

(For most map applications grouping based on spatial clusters would make a lot of sense. But this is highly complex)

KathiSchleidt commented 5 years ago

Hi Till I'm trying to understand how this could be placed within the context of the STA standard. For the moment, it is designed to provide data, and is not a processing service such as you describe. However, the processing functionality would make a great deal of sense in the context of the upcoming SensorThings Part 3 - Rules; while at core, the functionality is focused on linking inputs from sensors with actions to be performed via actuators, but aggregation and other processing functionality is what I've found to be missing in the initial proposals. Maybe you could provide support on integrating the OData Extension for Data Aggregation with the work on Rules?

riedel commented 5 years ago

Sorry to completely disagree on this one.

The ODATA extension also provide no relevant input to Rules as it does not provide any event-condition-action logic, etc.

Aggregation is fundamentally different from rules. IMHO thus it makes no sense to integrate that into rules. Grouping and aggregation is as much part of a database as $filter (There is pretty much no SQL database that does not provide GROUP BY).

To my understanding ST is a geospatial timeseries database and thus should support relevant query operations to that domain.

As OData is part of Part 1 it would be easy to integrate it into this part of the standard in a next revision. Maybe in the sense of an extended profile.

Maybe it does not even need to be part of the standard, but then it will be hard to integrate into any implementation (like FROST).

hylkevds commented 5 years ago

Personally, I think data aggregation is best done by an external tool, with the result pushed back as new observations on a separate (Multi)Datastream. The two main reasons are:

  1. The correct way to aggregate sensor data depends very much on the sampling regime of the sensor. For a sensor that only transmits a value when there is an actual change (quite common) taking a straight numeric average will just result in nonsense.
  2. Aggregation functions are computationally expensive. If certain aggregate results are important for a use case, it's better to pre-calculate those specific results once, than trying to calculate them on the fly. It's not a good idea to allow every user to run arbitrary aggregation functions on a 9+ digit-sized Observation collection.

I think it would be valuable to have a best practice for setting up aggregate (Multi)Datastreams, to get consistency between use cases.

riedel commented 5 years ago

I disagree mostly with both statements:

  1. Window-based down/re-sampling of data works for at 90% of all use cases I have seen in the last 10 years using sensor data. It also actually works equally well for irregular or regular samples. If you can plot something, you can calculate the average (area under curve divided by time). I agree, however, that intervals may impose a bit of challenge here (most timeseries databases don't support them), because you need a weighted sum. But tell me how many sensors do aggregate sampling over irregular timeframes? Supporting it naively would leave it to the user to use it reasonably and I think there is huge demand by anyone who wants to build a simple solution without requiring tons of layers of software. (A true problem is aggregating spatial features, but I think an 80% solution is better than none in this case)

  2. Timeseries databases like opentsdb and influx demonstrate that aggregates can be done at a high speed. I would even argue that the computational time used might be even lower than the I/O time if you pass it through the system. A user querying the whole datastream generates the same problem and even costs bandwidth, so I cannot follow that argument. I have been using OpenTSDB on large sensor data set without ever running into huge problems. Actually aggregate generation based on group by statements can quite well parallelized by the underlying database. On the other hand, I bet I can even write filter expressions in the current implementation that consume much more time: I think a standard should not dictate people, what is a good idea and what not.

If I want a solution based on a real big data architecture with precomputed views, I would actually probably not use sensor things and a transactional database in the first place (I can modify and delete observations, which makes consistency of any aggregate very difficult!). I also do typically not need all the querying capabilities if I know in advance what I want (on a side note: we are implementing something like this, but sensorthings gives us actually huge head ages because due to its statefulness, why sensorthings cannot be easily used as a view. )

All I would be asking, that in the standard a server "may" implement OData Aggregates and that there is defined semantic for it if one chooses to.

taniakhalafbeigi commented 5 years ago

Aggregation comes up in several of our use cases too. I think It is worth discussing it in SWG meeting. The spec is only defining how users are interacting with the API and I agree that if users are using it in a wrong way considering their nature of data, it would not be the problem of the specification. I think implementation could be challenging for aggregation. But I think implementation does not need to be as coherent as it looks to the end user. Users simply interact with the API as specified in the specification, but the implementation can have several modules handling different parts, just hidden from the user. I really think this is very valuable. Aggregation is similar to GeoJSON format discussed in #70, we see both coming up in a lot of the IoT use cases. It is always possible to handle it aside from the standard itself. But I think if a requirement is coming up in a lot of IoT use cases we should try our best to accommodate it in the API considering that SensorThings is an IoT standard.