Open fulmicoton opened 2 years ago
@fulmicoton Can I have a look into it with @sravan-s ?
I used Google pubsub quite a lot before, and also Nats. So we could try to implements pubsub then Nats if you are good with it. Maybe I can make a plan and share with you (especially on the way to not have data loss / ack message right away or wait before ack etc)
@magnalite opened the PR #3605 on this. We did not have the time to review it though. In this PR, at-least-once delivery is not yet there, but it seems achievable. Exactly-once delivery seems way harder.
@fmassot oh I see, so do you think it is better we have a look at an other provider like NATS? or you want us to try to help on the Google pubsub effort?
@AyWa, we are unfamiliar with Google Pub Sub, and having your opinion on the subject can help move forward with the PR. Can you first have a look at the PR? We also discussed a lot with @magnalite on Discord, don't hesitate to jump in there, as you will find valuable information about how we handle distributed indexing.
Google PubSub is a little bit different from Kinesis, Kafka, Pulsar.
First there are no way to manuallly handle partitions. The only way to control "which consumer gets what" is by having several topics. Second, there are no real notion of offsets.
Instead, one can create and resume from snaphosts, or by publish timestamps. Snapshots might be very difficult for us to use, as it would require an interaction between the indexer (downstream) and the source (upstream). Using the timestamp as a checkpoint should be ok.
TLDR: We can ack message right away, store timestamps as checkpoints, and seek to timestamp when we restart the source.