quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.87k stars 319 forks source link

Supporting Google PubSub as a source #1107

Open fulmicoton opened 2 years ago

fulmicoton commented 2 years ago

Google PubSub is a little bit different from Kinesis, Kafka, Pulsar.

First there are no way to manuallly handle partitions. The only way to control "which consumer gets what" is by having several topics. Second, there are no real notion of offsets.

Instead, one can create and resume from snaphosts, or by publish timestamps. Snapshots might be very difficult for us to use, as it would require an interaction between the indexer (downstream) and the source (upstream). Using the timestamp as a checkpoint should be ok.

TLDR: We can ack message right away, store timestamps as checkpoints, and seek to timestamp when we restart the source.

AyWa commented 1 year ago

@fulmicoton Can I have a look into it with @sravan-s ?

I used Google pubsub quite a lot before, and also Nats. So we could try to implements pubsub then Nats if you are good with it. Maybe I can make a plan and share with you (especially on the way to not have data loss / ack message right away or wait before ack etc)

fmassot commented 1 year ago

@magnalite opened the PR #3605 on this. We did not have the time to review it though. In this PR, at-least-once delivery is not yet there, but it seems achievable. Exactly-once delivery seems way harder.

AyWa commented 1 year ago

@fmassot oh I see, so do you think it is better we have a look at an other provider like NATS? or you want us to try to help on the Google pubsub effort?

fmassot commented 1 year ago

@AyWa, we are unfamiliar with Google Pub Sub, and having your opinion on the subject can help move forward with the PR. Can you first have a look at the PR? We also discussed a lot with @magnalite on Discord, don't hesitate to jump in there, as you will find valuable information about how we handle distributed indexing.