snowplow / stream-collector

Collector for cloud-native web, mobile and event analytics, running on AWS and GCP
http://snowplowanalytics.com
Other
27 stars 32 forks source link

Scala Stream Collector: implement write ahead log #13

Open BenFradet opened 6 years ago

BenFradet commented 6 years ago

If the streaming technology used (e.g. PubSub or Kinesis) is not available, the collector will keep on accumulating raw events in memory.

Those raw events should rather be flushed to disk for later recovery in a write ahead log.

alexanderdean commented 6 years ago

Nice idea. Need to consider what "to disk" means in a container-world...

benjben commented 5 years ago

To add a bit of context, on the rare occasion that the streams cannot be published to (Kinesis or PubSub outage), there can be data loss. We can increase collection reliability during stream failure by adding a mechanism to store the failed events after the max number of retries (e.g. into S3, GCS, RocksDB, etc.) and retry to publish them later.

Collection outages are very uncommon, but we want to do everything we can to mitigate the impact.