snowplow / snowbridge

For replicating streams across clouds, accounts and regions
Other
15 stars 7 forks source link

Update kinsumer fork to 1.3.0 #73

Closed colmsnowplow closed 2 years ago

colmsnowplow commented 3 years ago

In load tests for 0.5.0, an edge case occurred whereby some instances of the app (when deployed on ECS with fanned-out scaling) would hang after a prolonged period of sustained target failures (for example, with an EventHub target, this happens when EH throttles us).

The symptoms of the problem are that the observer reports all 0s, and debug messages for the observer timing out are logged. The instance in question does not process any data, and seems to fail to read. Eventually, we will get a shard error from kinsumer, which looks like this:

2021-08-07 02:31:00.855,"time=""2021-08-07T02:31:00Z"" level=error msg=""Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000007) in checkpointer.release: error releasing checkpoint: ConditionalCheckFailedException: The conditional request failed"" error=""Failed to pull next Kinesis record from Kinsumer client: shard error (shardId-000000000007) in checkpointer.release: error releasing checkpoint: ConditionalCheckFailedException: The conditional request failed"""

We suspect that this error is not necessarily the cause of the issue, but occurs when another kinsumer consumer claims ownership of the shard via the dynamoDB table.

Since the issue seems to only occur under a scenario where the target is consistently failing, there is good reason to suspect that checkpointing may be something to do with the issue.

We performed a test where logging was added to many different points in the app, and discovered that the app seems to hang on this line: https://github.com/snowplow-devops/stream-replicator/blob/dfcb5c438506d2c7d11075c6678fb441d6855346/pkg/source/kinesis.go#L106

However, putting a timeout on this action, then rebooting the app fails to resolve the issue. On reboot the app seems to go straight back to hanging on this line again, and enters a never-ending cycle of reboots.

This PR in the kinsumer library: https://github.com/twitchscience/kinsumer/pull/63 is not included in our fork - and so this may be something to do with the issue. Additionally, kinsumer's logging seems to be quite insufficient. I'm hopeful of learning more about the issue by updating our fork of kinsumer with the latest commit, and by adding better logging to the relevant parts of our fork of kinsumer.

colmsnowplow commented 2 years ago

Update - extensive testing and debugging revealed many related stability issues, which are now fixed in v1.3.0 of our fork of kinsumer.

The result is a somewhat degraded performance, but no more unexpected errors, eradication of race conditions, and reduction in the likelihood of duplicates.

Our issues here should be resolved by updating to v1.3.0.