streambed / streambed-rs

Event driven services toolkit
Apache License 2.0
31 stars 5 forks source link

Tolerate more topic corruption #42

Closed huntc closed 1 year ago

huntc commented 1 year ago

There are three changes here. First, when we initially produce to a topic, we now handle the find_offsets function returning an error instead of ignoring it. Second, we stay subscribed to a topic with the view that it will be recovered. Third, we need to exhaust decoding before reading more into our buffers... otherwise memory can grow.

The find_offsets function returns an error if it can't decode a record (amongst other things). The new logic will detect this error and truncate the active file to the length at which records can be decoded to. It will then attempt another pass with find_offsets. If this second pass fails then any producer will receive a CannotProduce error. I also discovered that only the first producer to a topic would receive this error, so this is an improvement as there can be multiple producers to the same topic.

The subscriptions will now behave in a similar fashion to when our Kafka flavour of the commit log can't connect... it will just keep trying with the view that something will recover it.

One case not addressed directly by the changes here is when anything other than the active file becomes corrupt. I'm not sure that can happen though, and we do have logic from a previous commit where incomplete compaction can now recover.

TODO: