Add Queue source with SQS implementation

rdettai commented 2 weeks ago

Description

This PR proposes the generic implementation of a "queue" source. For now, only an implementation for AWS SQS with its data backed by AWS S3 is exposed to the users. Google Pubsub as the queue implementation or inlined data (i.e messages containing the data itself and not the link to the object store) will come next.

We use the shard API to provide deduplication of messages. For the current implementation where the source data is stored on S3, the deduplication is made on the object URI.

High level summary of the abstractions that are part of the generic implementation:

Processor exposes the exact same methods as the Source trait but does not implement it directly. Instead, the concrete queue sources (e.g. SqsSource) wrap the Processor.
A pipeline of message states:
- RawMessage: the message as received from the Queue
- PreProcessedPayload: the message went through the minimal transformation to discover its partition id
- CheckpointedMessage: the message was checked against the shared state (shard API), it is now ready to be processed
- InProgressMessage: the message that is actively being read
QueueSharedState is an abstraction over shard API. By calling open_shard upon reception of the messages we avoid costly redundant processing when receiving a duplicate message.
QueueLocalState represents the state machine of the messages as they are processed by the indexing pipeline
VisibilityTaskHandle a task that extends the message visibility when required (needs to be reworked)

TODO:

[x] Generic implementation of the queue source based on the Processor abstraction
[x] SQS implementation of the Queue trait
[x] Adapt the shard open_shard API to accept publish_token as a field. This gives upsert semantics to the API which makes it possible to acquire the shard upon creation.
[x] Change the switching logic in the metastore implementation to allow any source to use the shard API for checkpointing (adding SourceConfig.use_shard_api())
[x] Unit tests the Processor abstraction
[x] Unit test the ShardState abstraction
[x] LocalStack tests for the SqsSource (with some small refactoring to reuse the setup_index helper from the KafkaSource)
[x] Implement the observable state updates
[x] Change the source config to be a file source with SQS notification instead of an SQS source backed by a file.
[x] Error handling (transient can be retried, bad messages should be discarded(?)...) (see comment)
[x] Deduplicate within a received message batch
[x] Resume the indexing of an unfinished file after a failure
[x] Use 2x commit timeout as last visibility extension after InProgress status
[x] End to end test the validate the queue behaviour after commiting the shard progress in the Publisher actor

TODO in subsequent PRs:

GCP Pubsub (small)
data within the queue payload (small)
shard garbage collection (medium)
improve the visibility extension logic (medium)

How was this PR tested?

This PR contains unit tests and higher level tests that use LocalStack.

github-actions[bot] commented 2 weeks ago

On SSD:

Average search latency is 1.01x that of the reference (lower is better).
Ref run id: 2337, ref commit: 4ade7b5ec59685fc508c1834f2e23b5ca7b5afbe
Link

On GCS:

Average search latency is 0.981x that of the reference (lower is better).
Ref run id: 2339, ref commit: 4ade7b5ec59685fc508c1834f2e23b5ca7b5afbe
Link

fulmicoton commented 1 week ago

we need a different handling of transient vs non-transient error. e.g. in the message parsing -> non-transient disconnection while streaming file -> transient... gzip corruption -> non-transient.

rdettai commented 1 day ago

Note: I just stumbled upon https://github.com/quickwit-oss/quickwit/issues/1065 which is addressed as part of the reorganization of the FileSource that is happening here:

created a seprate stdin source
it is now allowed to create a file source with a filepath through the API as it can be a valid usecase (using an object storage URI or a local path mounted on a network file system)

quickwit-oss / quickwit