snowplow-incubator / common-streams

Other
1 stars 0 forks source link

Use ListOfList as primary data structure for streaming apps #31

Closed istreeter closed 9 months ago

istreeter commented 9 months ago

In Snowplow stream-processing apps, we often pass around types like List[Event] or Vector[Event] or Chunk[Event]. But which collection is better? Having now built several apps from the common-streams libraries, it is becoming clear what features we need from the data structure:

I am starting to believe the perfect data structure for snowplow streaming apps is a List[List[A]]. It is extremely fast to batch up small batches into larger batches. It is fast to iterate, traverse, or fold, as long as we don't care about the ordering of events within a batch.

A big part of my motivation is to stop copying data so many times, i.e. avoid calling vector.toList or list.toVector or listOfLists.flatten. By using a ListOfList everywhere it seems I can minimize copying data, in a way that seems to naturally fit with the flow of the application.