snowplow / snowplow-rdb-loader

Stores Snowplow enriched events in Redshift, Snowflake and Databricks
Other
31 stars 17 forks source link

Transformer pubsub subscriber needs an extra thread for acking and ack extensions #1322

Closed istreeter closed 11 months ago

istreeter commented 11 months ago

We've seen the pubsub streaming transformer can occasionally produce duplicates, especially around startup/shutdown.

Previously, we were using the MoreExecutors.directExecutor to execute Subscriber callbacks. It was a trick we invented so that we could safely turn off FlowControl without PubSub sending us too many events and causing us to run out of memory.

But, because we block the thread on those callbacks, it meant the Subscriber was not able to do admin tasks like acking and managing ack lease time.

We can continue to block the subscriber, but we need to make sure the subscriber still has a thread available for the admin tasks. I'm hoping that if the subscriber has this dedicated thread for acking and ack extensions, then we should get fewer duplicates in the warehouse.

istreeter commented 11 months ago

Closing this.... It is not needed. My description above is wrong and was based on a misunderstanding.