In Snowplow stream-processing apps, we often pass around types like List[Event] or Vector[Event] or Chunk[Event]. But which collection is better? Having now built several apps from the common-streams libraries, it is becoming clear what features we need from the data structure:
We often need to batch-up from a small batch to a larger batch, i.e. it is common to see a.events ++ b.events. This is a point in favour of Vector, because combining two vectors is fast. Whereas for Lists you end up traversing and copying one of the lists.
Order is never important. Snowplow event-processing apps make no guarantees about what order events get processed/emitted/loaded (as long as we pay attention to checkpointing to kinesis in strict order). This means we can take real advantage of things like List's fast prepend head :: tail, even though it tends to build up a collection in reverse order.
We often need to separate events into good and bad. e.g. val (failedEvents, goodEvent) = parseAndSeparate(inputs). We've already made a fast implementation of this using features of List.
We occasionally need fast lookup by index (e.g. val failedEvent = events(42)) but only under unusual circumstances, e.g. when an event couldn't be inserted to a table because the table needs altering. We should not optimise for fast lookup by index, because it is so rare.
Needs to be thread safe. So easier to stick with immutable data structures, instead of anything like mutable buffer.
We occasionally need to know the length of a batch, mainly when emitting statsd metrics, e.g. metrics.incrementGoodCount(events.length). Calculating the length of a List is fairly slow compared to other data structures. However, we can work around this by doing a fast check of the input Chunk (val rawCount = chunk.length) and then, if needed, subtract the length of failed events, because that is hopefully a short list: val goodCount = rawCount - failedEvents.length. In any case, this traversal is only needed once per batch.
I am starting to believe the perfect data structure for snowplow streaming apps is a List[List[A]]. It is extremely fast to batch up small batches into larger batches. It is fast to iterate, traverse, or fold, as long as we don't care about the ordering of events within a batch.
A big part of my motivation is to stop copying data so many times, i.e. avoid calling vector.toList or list.toVector or listOfLists.flatten. By using a ListOfList everywhere it seems I can minimize copying data, in a way that seems to naturally fit with the flow of the application.
In Snowplow stream-processing apps, we often pass around types like
List[Event]
orVector[Event]
orChunk[Event]
. But which collection is better? Having now built several apps from the common-streams libraries, it is becoming clear what features we need from the data structure:a.events ++ b.events
. This is a point in favour of Vector, because combining two vectors is fast. Whereas for Lists you end up traversing and copying one of the lists.head :: tail
, even though it tends to build up a collection in reverse order.val (failedEvents, goodEvent) = parseAndSeparate(inputs)
. We've already made a fast implementation of this using features of List.val failedEvent = events(42)
) but only under unusual circumstances, e.g. when an event couldn't be inserted to a table because the table needs altering. We should not optimise for fast lookup by index, because it is so rare.metrics.incrementGoodCount(events.length)
. Calculating the length of a List is fairly slow compared to other data structures. However, we can work around this by doing a fast check of the inputChunk
(val rawCount = chunk.length
) and then, if needed, subtract the length of failed events, because that is hopefully a short list:val goodCount = rawCount - failedEvents.length
. In any case, this traversal is only needed once per batch.I am starting to believe the perfect data structure for snowplow streaming apps is a
List[List[A]]
. It is extremely fast to batch up small batches into larger batches. It is fast to iterate, traverse, or fold, as long as we don't care about the ordering of events within a batch.A big part of my motivation is to stop copying data so many times, i.e. avoid calling
vector.toList
orlist.toVector
orlistOfLists.flatten
. By using aListOfList
everywhere it seems I can minimize copying data, in a way that seems to naturally fit with the flow of the application.