snowplow / enrich

Snowplow Enrichment jobs and library
https://snowplowanalytics.com
Other
21 stars 39 forks source link

Common: optimize weather data cache #124

Open chuwy opened 4 years ago

chuwy commented 4 years ago

Simplified example using a single spatial dimension called "displacement" rather than 2 spatial dimensions + time.

Assume that the user has specified a tolerance of 1km, meaning that our reported weather must be the actual weather somewhere within 1km of the event.

As I understand it, the current strategy is:

For each event, round its displacement to the nearest 2km. If the rounded displacement is in the cache, use the cached value; otherwise, look up the weather information, add it to the cache with the rounded displacement as key, and return it.

The disadvantage of this strategy is that the returned results are unnecessarily innaccurate.

An alternative strategy:

Instead of the cache being a Map[Displacement, Weather] we make it a Map[RoundedDownDisplacement, (ActualDisplacement, Weather)]. Each key represents an interval of 1km. For example, the key 7 represents the interval of all displacements between 7km and 8km.

For each event, let d be the actual displacement of the event in km. Check the cache for floor(d-1), floor(d), and ceiling(d). If any one of those keys is in the cache and if the corresponding actual displacement value of the key is within 1km of the non-rounded displacement of the event, use that key's weather. Otherwise, add floor(d): (d, getWeather(d)) to the cache and return getWeather(d).

For example, suppose an event has displacement 3.7km, and the cache looks like this:

{
  2: (2.3, sunny),
  4: (4.4, rainy)
}

First we see that floor(3.7-1)=2 is in the cache, but the difference between 2.3 and 3.7 is more than 1 (our specified tolerance) so we cannot use this cached value. Next we see that floor(3.7-1)=3 is not in the cache at all. Finally we see that ceiling(3.7)=4 is in the cache and 4.4 - 3.7 < 1 so we use "rainy" as the weather value for the event.

If 4 had not been in the cache, we not be able to use any cached value and would have to look up 3.7 in the API and add it to the cache:

{
 2: (2.3, sunny),
 3: (3.7, slightly rainy)
}

Note that we look up the value from the API for 3.7km exactly, rather than 4km (as we would under the original strategy).

It should be possible to extend this idea from a single dimension to 3 dimensions (latitude, longitude, and time). Instead of intervals of 1km, the keys in the cache would represent cubes of 1km x 1km x 1hr, and we would check up to 27 keys rather than up to 3.

Any thoughts? Is there an easier way to achieve this? Does this make any sense at all?

chuwy commented 4 years ago

Migrated from https://github.com/snowplow/snowplow/issues/2158 (comments are auto-generated)