quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
6.99k stars 291 forks source link

Allow disabling retries in indexing loop or stopping indexing via Lambda handler #5179

Open alexkreidler opened 1 week ago

alexkreidler commented 1 week ago

Is your feature request related to a problem? Please describe. I mistakenly passed an invalid value to the quickwit-lambda INDEX_CONFIG_URI environment variable because I thought it could accept http URIs not just filesystem URIs. The lambda function proceeded to retry in an exponential backoff loop for 15 minutes (even after I deleted it after a few minutes in).

Logs

``` INIT_START Runtime Version: provided:al2.v37 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:bc2882fd0e085da713a4e150009e80c93e37aef25d53897e472ddda5ffbd589d START RequestId: 27d083f2-95f0-4bca-ae05-db5639e8c6d9 Version: $LATEST 2024-06-28T04:59:25.108Z INFO Lambda runtime invoke:indexer_handler: quickwit_telemetry::sender: telemetry to https://telemetry.quickwit.io/ is enabled requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.124Z INFO Lambda runtime invoke:indexer_handler: quickwit_config::node_config::serialize: using listen address `127.0.0.1` as advertise address advertise_address=127.0.0.1 requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.124Z WARN Lambda runtime invoke:indexer_handler: quickwit_config::node_config::serialize: peer seeds are empty requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.124Z INFO Lambda runtime invoke:indexer_handler: quickwit_lambda::utils: loaded node config config=NodeConfig { cluster_id: ""lambda-ephemeral"", node_id: ""lambda-indexer"", enabled_services: {Metastore, Janitor, Searcher, ControlPlane, Indexer}, gossip_listen_addr: 127.0.0.1:7280, grpc_listen_addr: 127.0.0.1:7281, gossip_advertise_addr: 127.0.0.1:7280, grpc_advertise_addr: 127.0.0.1:7281, gossip_interval: 1s, peer_seeds: [], data_dir_path: ""/tmp"", metastore_uri: Uri { uri: ""s3://my-quickwit-index/index"" }, default_index_root_uri: Uri { uri: ""s3://my-quickwit-index/index"" }, rest_config: RestConfig { listen_addr: 127.0.0.1:7280, cors_allow_origins: [], extra_headers: {} }, grpc_config: GrpcConfig { max_message_size: 21.0 MB }, storage_configs: StorageConfigs([]), metastore_configs: MetastoreConfigs([]), indexer_config: IndexerConfig { split_store_max_num_bytes: 107.4 GB, split_store_max_num_splits: 1000, max_concurrent_split_uploads: 12, max_merge_write_throughput: None, merge_concurrency: 1, enable_otlp_endpoint: true, enable_cooperative_indexing: false, cpu_capacity: CpuCapacity(2000) }, searcher_config: SearcherConfig { aggregation_memory_limit: 500.0 MB, aggregation_bucket_limit: 65000, fast_field_cache_capacity: 1000.0 MB, split_footer_cache_capacity: 500.0 MB, partial_request_cache_capacity: 64.0 MB, max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100, split_cache: None }, ingest_api_config: IngestApiConfig { max_queue_memory_usage: 2.1 GB, max_queue_disk_usage: 4.3 GB, replication_factor: 1, content_length_limit: 10.5 MB }, jaeger_config: JaegerConfig { enable_endpoint: true, lookback_period_hours: 72, max_trace_duration_secs: 3600, max_fetch_spans: 10000 } } requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.179Z INFO Lambda runtime invoke:indexer_handler:lazy_load_credentials: aws_credential_types::cache::lazy_caching: credentials cache miss occurred; added new AWS credentials (took 20.557µs) requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.249Z INFO Lambda runtime invoke:indexer_handler: quickwit_lambda::indexer::ingest::helpers: Index not found, creating it index_id=""test-index"" index_config_uri=""s3://my-quickwit-index-config/index-config.yaml"" requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.253Z INFO Lambda runtime invoke:indexer_handler:lazy_load_credentials: aws_credential_types::cache::lazy_caching: credentials cache miss occurred; added new AWS credentials (took 11.862µs) requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.283Z INFO Lambda runtime invoke:indexer_handler: quickwit_config::index_config::serialize: index config does not specify `index_uri`, falling back to default value index_id=test-index index_uri=s3://my-quickwit-index/index/test-index requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.361Z INFO Lambda runtime invoke:indexer_handler: quickwit_lambda::indexer::ingest::helpers: index created requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.361Z INFO Lambda runtime invoke:indexer_handler: quickwit_cluster::cluster: joining cluster cluster_id=lambda-ephemeral node_id=lambda-indexer generation_id=1719550765361866605 enabled_services={Indexer, Janitor} gossip_listen_addr=127.0.0.1:7280 gossip_advertise_addr=127.0.0.1:7280 grpc_advertise_addr=127.0.0.1:7281 peer_seed_addrs= requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.361Z INFO Lambda runtime invoke:indexer_handler: chitchat::server: initial_seed_addrs={} requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.363Z INFO Lambda runtime invoke:indexer_handler: quickwit_janitor: starting janitor service requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.363Z WARN Lambda runtime invoke:indexer_handler: quickwit_janitor: delete task service is disabled: delete queries will not be processed requestId=""27d083f2-95f0-4bca-ae05-db5639e8c6d9"" xrayTraceId=""Root=1-667e432c-144d1373649ff6be11253a6c;Parent=04db4f312ff1b20e;Sampled=0;Lineage=8e5c6e72:0"" request_id=""27d083f2-95f0-4bca-ae05-db5639e8c6d9""2024-06-28T04:59:25.370Z INFO quickwit_cluster::change: node `lambda-indexer` has joined the cluster node_id=lambda-indexer generation_id=1719550765361866605 2024-06-28T04:59:25.389Z INFO quickwit_janitor::actors::garbage_collector: loaded 1 indexes from the metastore 2024-06-28T04:59:25.396Z INFO spawn_pipeline: quickwit_indexing::actors::indexing_pipeline: spawning indexing pipeline index_id=""test-index"" source_id=""_ingest-lambda-source"" pipeline_uid=00000000000000000000000000 root_dir=/tmp/indexing/test-index%01J1EKCT7AZP9QBSQCMJ78Q50R%_ingest-lambda-source%00000000000000000000000000%QqVof6 index=test-index gen=0 2024-06-28T04:59:25.397Z ERROR quickwit_indexing::actors::indexing_pipeline: error while spawning indexing pipeline, retrying after some time error=failed to create source `_ingest-lambda-source` of type `file`. Cause: unknown URI protocol `https` Caused by: unknown URI protocol `https` retry_count=0 retry_delay=2s 2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::doc_processor::DocProcessor-fragrant-M0AB""2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-fragrant-M0AB exit_status=success 2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Indexer-morning-FaT1 exit_status=success 2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::index_serializer::IndexSerializer-fragrant-wrd5""2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::index_serializer::IndexSerializer-fragrant-wrd5 exit_status=success 2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: no more messages actor=""Packager-summer-vkFg""2024-06-28T04:59:25.397Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Packager-summer-vkFg exit_status=success 2024-06-28T04:59:25.397Z INFO spawn_merge_pipeline: quickwit_indexing::actors::merge_pipeline: spawning merge pipeline index_id=test-index source_id=_ingest-lambda-source pipeline_uid=00000000000000000000000000 root_dir=/tmp/indexing/test-index%01J1EKCT7AZP9QBSQCMJ78Q50R%_ingest-lambda-source%00000000000000000000000000%QqVof6 merge_policy=StableLogMergePolicy { config: StableLogMergePolicyConfig { min_level_num_docs: 100000, merge_factor: 10, max_merge_factor: 12, maturation_period: 172800s }, split_num_docs_target: 10000000 } index=""test-index"" gen=0 2024-06-28T04:59:25.397Z INFO spawn_merge_pipeline: quickwit_indexing::actors::merge_pipeline: loaded list of published splits num_splits=0 index=""test-index"" gen=0 2024-06-28T04:59:25.398Z INFO quickwit_janitor::actors::retention_policy_executor: loaded 1 indexes from the metastore 2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: no more messages actor=""IndexUploader-floral-JC4b""2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=IndexUploader-floral-JC4b exit_status=success 2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::sequencer::Sequencer-white-kOmg""2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::sequencer::Sequencer-white-kOmg exit_status=success 2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: no more messages actor=""Publisher-icy-5uDS""2024-06-28T04:59:25.400Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Publisher-icy-5uDS exit_status=success 2024-06-28T04:59:27.399Z INFO spawn_pipeline: quickwit_indexing::actors::indexing_pipeline: spawning indexing pipeline index_id=""test-index"" source_id=""_ingest-lambda-source"" pipeline_uid=00000000000000000000000000 root_dir=/tmp/indexing/test-index%01J1EKCT7AZP9QBSQCMJ78Q50R%_ingest-lambda-source%00000000000000000000000000%QqVof6 index=test-index gen=0 2024-06-28T04:59:27.399Z ERROR quickwit_indexing::actors::indexing_pipeline: error while spawning indexing pipeline, retrying after some time error=failed to create source `_ingest-lambda-source` of type `file`. Cause: unknown URI protocol `https` Caused by: unknown URI protocol `https` retry_count=1 retry_delay=4s 2024-06-28T04:59:27.399Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::doc_processor::DocProcessor-hidden-TDQv""2024-06-28T04:59:27.399Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-hidden-TDQv exit_status=success 2024-06-28T05:07:55.412Z INFO spawn_pipeline: quickwit_indexing::actors::indexing_pipeline: spawning indexing pipeline index_id=""test-index"" source_id=""_ingest-lambda-source"" pipeline_uid=00000000000000000000000000 root_dir=/tmp/indexing/test-index%01J1EKCT7AZP9QBSQCMJ78Q50R%_ingest-lambda-source%00000000000000000000000000%QqVof6 index=test-index gen=0 2024-06-28T05:07:55.413Z ERROR quickwit_indexing::actors::indexing_pipeline: error while spawning indexing pipeline, retrying after some time error=failed to create source `_ingest-lambda-source` of type `file`. Cause: unknown URI protocol `https` Caused by: unknown URI protocol `https` retry_count=8 retry_delay=512s 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::doc_processor::DocProcessor-young-hC2f""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-young-hC2f exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Indexer-damp-crmm exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::index_serializer::IndexSerializer-cold-8pzg""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::index_serializer::IndexSerializer-cold-8pzg exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""Packager-sparkling-RjAR""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Packager-sparkling-RjAR exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""IndexUploader-lingering-ASZJ""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=IndexUploader-lingering-ASZJ exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""quickwit_indexing::actors::sequencer::Sequencer-restless-FbUq""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::sequencer::Sequencer-restless-FbUq exit_status=success 2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: no more messages actor=""Publisher-hidden-GbyG""2024-06-28T05:07:55.413Z INFO quickwit_actors::spawn_builder: actor-exit actor_id=Publisher-hidden-GbyG exit_status=success 2024-06-28T05:09:25.406Z INFO quickwit_janitor::actors::garbage_collector: loaded 1 indexes from the metastore 2024-06-28T05:14:25.147Z 27d083f2-95f0-4bca-ae05-db5639e8c6d9 Task timed out after 900.06 seconds END RequestId: 27d083f2-95f0-4bca-ae05-db5639e8c6d9 REPORT RequestId: 27d083f2-95f0-4bca-ae05-db5639e8c6d9 Duration: 900055.97 ms Billed Duration: 900089 ms Memory Size: 3008 MB Max Memory Used: 51 MB Init Duration: 88.76 ms INIT_START Runtime Version: provided:al2.v37 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:bc2882fd0e085da713a4e150009e80c93e37aef25d53897e472ddda5ffbd589d ```

Describe the solution you'd like I'd like to be able to disable the retry functionality in the indexing pipeline with a configuration option and or environment variable.

If a user passes in invalid configuration that creates an error in the spawning of the indexer (like the error in load_source in my case), it will retry potentially forever, which isn't a problem for CLI users who can kill the process but is problematic for Lambda and other environments.

A TODO message indicates this might have already been considered.

Describe alternatives you've considered Another solution specifically for the Lambda environment would be to modify the request handler to have a separate kill endpoint/event type that would call indexing_service_handle.kill() so users could manually stop indexing if they want. This might be useful in cases besides an error loop such as an unexpectedly large input dataset.