Historical data representation

Summary

We tend to think of historical data as time series and we represent them accordingly in QL. But is this really a good fit for all kinds of historical data we collect? It surely works well for measurements taken at regular intervals, but not so well for tracking state changes. We need to do some brainstorming to figure out how to store data in a way that's then easy to query. So let's get the ball rolling...

Detail

Ideally each time series we collect should be a time-indexed sequence of device data observed at regular intervals, e.g. every 5 seconds. This is akin to sampling in signal processing and makes it easy to write queries for feeding data into visualisation tools like Grafana.

While this may be true of data collected from devices that send measurements at regular intervals (think temperature sensor), it's definitely not the case for some other kind of devices that only send out data when the device state changes. For example, a parking sensor may only send a notification message when a car enters or leaves a parking bay which makes it tricky to write queries to sample device state---e.g. how many bays were free between 2 and 3pm in the past week? It could be that some bays freed up before 2pm and so we have no corresponding data points between 2 and 3pm which will cause the query to return bogus results.

We have to figure out how to represent this kind of data with a time series or come up with an alternative arrangement that makes it easy to query device state.

How to cope with missing data?

The problem here boils down to how we interpret the data. The conceptual model we have in mind when we think of a time series is that of a sequence of data points d[t[n]] indexed by a sequence t[n] where:

t[n] is a point in time.
t[n] < t[n+1]
t[n+1] - t[n] = c for some constant c, i.e. the time points are equally spaced in time.

We then write queries against data collected for a time period of interest P by making a number of implicit assumptions:

c is "reasonably" small
we can partition P into "sufficiently" large intervals [p[n], p[n+1]] so that each interval will contain some data points---e.g. c = 5 secs and p[n+1] - p[n] = 1 min.

The trouble is that in practice we can't be sure (3) holds true and as a result (4) and (5) go out of the window. So there will be situations when some or even most of the intervals [p[n], p[n+1]] contain no data points. How are we supposed to interpret the absence of data? Well it depends on the kind of device that's sending us data and/or software errors. If the device is measuring a continuous, time-varying quantity (e.g. temperature) at periodic intervals m[k], then it's tempting to say that lack of data for an [p[n], p[n+1]] implies a software error---e.g. Orion didn't notify QL. But that assumes for each n there exist k such that m[k] ∈ [p[n], p[n+1]]. Perhaps a fair assumption if (3), (4), and (5) hold true. On the other hand, if the device is only notifying us of state changes (e.g. parking sensor), lack of data typically means that the state in the most recent notification is still the current state.

Example

Here's a visual representation of the parking lot example where it's easier to see how trying to represent discrete dynamical systems with time series may lead to bogus results when querying the data.

parking-lot-example

For example, how would we write a query to figure out what is the free to occupied ratio for the given period of interest? Or to find out how many bays were free/occupied on average? In the diagram above there happens to be just one data point of the same kind (free/occupied) in each interval [p[n], p[n+1]] but if we partitioned the period of interest in intervals double the size, then the interval [p2, p4] would contain two "occupied" data points as well as two "free" data points. How would that affect the previous queries?

Here's an algebraic "interpretation" of what it means to keep a historical record of device data. It'll come in handy later when discussing this issue further.

automaton-history

Here's another example where interpreting the data as a "plain" time series could result in misleading analytics. Say we have a waste container with a sensor that sends us periodic waste level measurements in kilograms:

waste-container-example

How would we go about answering basic questions like: how much waste did the container collect from time point p0 to p1? The first thing that springs to mind is writing a query to sum waste level data in the QL time series, something along the lines of:

SELECT sum(waste_level)
FROM waste_container_entity
WHERE p0 <= time_index AND time_index <= p1
  AND entity_id = 'device:25'

which would bring back 440kg. But a closer look at the diagram above reveals the right answer is actually 130kg---only considering values collected in the time series. The problem with the query is that just looking at the data points in the time series without considering waste container state (i.e. when it gets emptied) isn't enough to get the right answer. In fact, we need to consider the sequence t[i] of time points at which the container got emptied and the timepoints q[i]

(q0 = p0) ≤ (q1 = t[n0]) < t[n1] < ... <  t[n[k]] ≤ (q[k] = p1)

where t[n[j]] are the t[i] that fall in the interval [p0, p1]---if any. Then we should take the max of each interval

m[j] = max { waste_level | waste_level in [q[j-1], q[j]] }

and sum them up to get the right answer:

waste collected from p0 to p1 = ∑ m[j]

So we'll have to be able to query device data not only in relation to time but also to device state and location. While location shouldn't be too much of a problem (well, at least if we can assume devices are immovable), state is trickier. So let's start with a conceptual model of devices, observations, time and state. Then we'll build on this to come up with a possible solution for the kind of dashboard queries we'd like to support initially.

Conceptual model

Devices collect data about physical resources or systems and may reflect their state---e.g. an occupied parking bay, an empty waste container, current temperature at some place. So devices give us information about how some system evolves over time. Things of interest typically include:

state of a set of resources at a specified time point or interval
tracking of data points over time in relation to state changes
partitioning of data points by state over time

For example:

How many parking bays are currently free?
How many were free two hours ago?
How did bay occupancy vary over the weekend?
What was bay availability (free vs occupied ratio) for the past week?
How much waste was in a container when it got emptied?

To answer such questions we have to be able to relate data points to the resource state at the time when they were collected. If we model resources as automatons, then a device collects data about an automaton so each data point has an associated automaton state and the order in which data points are collected has to agree with the automaton evolution over time. Since the queries we'll have to support only involve automaton state (events and transitions aren't relevant), we can "squeeze" the automaton into a simpler system were there are no self-transitions and all transitions from a state to another get collapsed into one. Here's a diagram to sum up the conceptual model.

automaton-history-with-prosets

The state evolution map tracks how the system evolves in response to events that trigger transitions to new states: starting from an initial state at 0, when an event results in a transition out of this state, the evolution step count becomes 1 and gets associated to the new state, and so on. So the state evolution map records a path in the automaton graph. On the other hand, the step map ties observations to state evolution: it tells you which data points were collected in which state. Finally, the time index records when an observation was made so that if we take the three maps together we're able to tell that e.g. data point d[5] was collected at 10:55 while the system was in the x state.

Note: Algebraic model

@taliaga @chicco785 A bit of a handwavy explanation, let me know if I should add more detail and/or make it clearer. For the record, there's a precise and more concise way of saying all this, namely that those maps are functors among prosets, but I don't think we need to get bogged down with the mathematical details? Also note this model came about as a simplification of the initial one I sketched out in a previous comment since I felt we don't need that level of generality at this stage?

Worked example

Here's an example of how to put all that abstract nonsense to good use to write relatively simple SQL queries for seemingly hairy scenarios. Suppose we want to know the availability ratio (available vs unavailable) of an EV charging station over a given period of time (e.g. a week) and each data point comes with time and state (free, occupied, charging) attributes, visually:

availability-ratio 1 worked-example

Then it's relatively easy to work out how to compute the wanted ratio: take all records between time points p0 and p1 whose state is occupied (O) or charging (C); partition (group) the resulting set by declaring two records equivalent if there's a path between them (we can check there's a path since we have recorded the evolution steps); then for each equivalence class (group) compute the time difference between the earliest and latest record; finally sum these values and divide the result by p1 - p0. Schematically:

availability-ratio 2 worked-example

The left hand side exemplifies the process of getting the wanted ratio. But there's a snag: checking there's an evolution path between two records in SQL isn't the easiest thing you can think of. It can be done with a little effort though using e.g. recursive queries but Crate doesn't support that. So we can use an alternative computation (see the right hand side) where we partition by grouping by evolution step. As you can see from the diagram this isn't the right answer in general but could be close enough for practical purposes. In pseudo-SQL:

SELECT sum(delta) / (p1 - p0)
FROM
    SELECT (max(t) - min(t)) AS delta
    FROM entity
    WHERE id = $entity_id
      AND t ∈ [p0, p1]
      AND (state = O ∨ state = C)
    GROUP BY step

Also we can easily extend this to work across entities to get the availability ratio for the whole parking lot: all we need to do is sum the availability ratio of each entity and divide by the number of entities. The trick actually works in general to compute the overall state ratio for a number of resources given their individual ratios and here's why:

state-ratio-aggregation

Note: Probabilistic model

Yep. Yet another handwavy explanation. Anyhoo, it's probably best to recast the above as a probabilistic model (excuse the pun!) whereby e.g. the overall ratio is the probability of a resource to be in a given state at any time point within the chosen period of interest...

Note: State averages over time?!

While it often makes sense to ask how many resources are in a given state at a point in time (e.g. how many bays are free at the moment? How many bikes were available for rent yesterday at 11 am?), it's not at all obvious (at least not to me!) what that's supposed to mean over a period of time. If I say that 7 bays where free yesterday, what would you make of it? That 7 were free throughout the day? Or that I probably meant "free on average"? Or what? It seems a popular interpretation is "on average". But taking averages in such cases is a slippery road down hill at best and might make no sense at all at worst. For example, consider the scenario sketched out in the diagram above where a period of interest divided up in 8 sub-intervals and say, for the sake of argument, Quantum Leap has 24 data points with a state of "available" in that period of interest. Should we take the overall average availability to be 24/8 = 3?! That says in that period of of time on average 3 resources where available, but hang on, we only have 2...

Proposed solution

We could extend Quantum Leap to store an additional two attributes for each entity:

State label: string to identify each of the automaton states.
Step counter: long to count transitions from the initial state.

While some devices provide an indication of the current state (e.g. parking sensor) others apparently don't (e.g. waste sensor) in which case we'll have to fill this in before inserting a data point into the DB. @taliaga made a very good point that this shouldn't happen in QL if we want to keep generality! On the other hand, the step counter can be computed in a totally application-agnostic way simply by adding 1 to the last stored counter value if the state label is different---but be mindful of concurrency here!

How to query the time series though? One option would be to implement a simple plugin architecture whereby QL makes provisions for registering modules implementing arbitrary queries and exposing them as REST endpoints. For example,

GET /queries/parking/availability?from=xxx&to=yyy&...

would be routed to the Python script availability.py in the parking directory within a configured plugin directory. This should be quite a straightforward thing to do and would save us from having to think about generic queries.

In fact, another option would be to come up with a (semi-)generic way of querying time series by partitioning them by state and time and then apply aggregate functions to those partitions---e.g. think of generalising the query in the above worked example. Then we'll need a way to encode that in a URL. While it can be done, I'm not entirely sure yet how to get a good level of generality here since we only have a handful of examples to look at. On the other side of the spectrum, we could go full-on generic, concoct a REST API that lets you encode almost arbitrary SQL queries (possibly an extension of the NGSI Simple Query Language?), use ASTs to represent, analyse, and translate them, etc. This approach also has the welcome side effect of killing two birds with one stone: refactor our Crate translator and avoid SQL injection...

@c0c0n3 there must be a simpler way than representing directly the state in ql for each entity. consider that the same entity, for different parameters may have a different concept of state. as a matter of fact, you can define "a function" to compute the state and steps dynamically. o maybe you want to change (for whatever reason) the way state is computed (and hence steps defined)

@chicco785 yah, perhaps there's an even simpler way of doing this. But keep in mind that not storing the state info and computing it dynamically could have a huge negative impact on performance---e.g. how to group data points by step efficiently?

i am thinking of using views. but at the moment is suspect there is no way given the limitations of crate supported SQL. but i may 100 % wrong :)

orchestracities / ngsi-timeseries-api