quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.02k stars 327 forks source link

Running second local-ingest instances breaks first one #4826

Open nh2 opened 6 months ago

nh2 commented 6 months ago

While one local-ingest is running, if I start another one, the first one breaks.

Steps to reproduce the behavior:

  1. quickwit-0.8.0
  2. quickwit tool local-ingest --index my_logs --input-path /logs/node-4, running at 80 MB/s
  3. While the above is running, in another terminal: quickwit tool local-ingest --index my_logs --input-path /logs/node-5

The second local-ingest starts to run, and the first one starts outputting errors:

 [first running at 80 MB/s; as soon as I start the second one, the throughput drops down to 28 MB/s for a few seconds]
 Num docs 279120325 Parse errs     2 PublSplits 161 Input size 110445MB Thrghput 28.14MB/s Time 00:27:08
 Num docs 279181921 Parse errs     2 PublSplits 161 Input size 110468MB Thrghput 26.59MB/s Time 00:27:09
 Num docs 279264368 Parse errs     2 PublSplits 161 Input size 110499MB Thrghput 27.64MB/s Time 00:27:10
 Num docs 279334404 Parse errs     2 PublSplits 161 Input size 110526MB Thrghput 25.39MB/s Time 00:27:11
 Num docs 279411928 Parse errs     2 PublSplits 161 Input size 110555MB Thrghput 27.59MB/s Time 00:27:12
2024-04-01T11:20:08.098Z ERROR quickwit_actors::spawn_builder: actor-failure cause=Failed to open file for read: 'FileDoesNotExist("/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm")'

Caused by:
    Files does not exist: "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm" exit_status=Failure(Failed to open file for read: 'FileDoesNotExist("/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm")'

Caused by:
    Files does not exist: "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm")
2024-04-01T11:20:08.098Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=quickwit_indexing::actors::index_serializer::IndexSerializer-snowy-haD5 exit_status=Failure(Failed to open file for read: 'FileDoesNotExist("/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm")'

Caused by:
    Files does not exist: "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6BWFDF6MX98EY0QXV35Y-fe7Kdy/8410de7d3491410594e1ca2bb4a7b810.fieldnorm")
2024-04-01T11:20:08.105Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-floral-a2Tz exit_status=DownstreamClosed
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="SourceActor-floral-a2Tz"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::doc_processor::DocProcessor-polished-sxQZ"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Indexer-frosty-l45l"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::index_serializer::IndexSerializer-snowy-haD5"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Packager-floral-xkMG"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="IndexUploader-dark-h3aq"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-red-qRI6"
2024-04-01T11:20:08.635Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Publisher-summer-RoNq"
2024-04-01T11:20:08.635Z ERROR quickwit_indexing::actors::indexing_pipeline: Indexing pipeline failure. pipeline_id=IndexingPipelineId { node_id: "corp1", index_uid: IndexUid { index_id: "server_logs", incarnation_id: Ulid(2069641637541167640949149121706776649) }, source_id: "_ingest-cli-source", pipeline_uid: Pipeline(01HTCMMVKZDHYR2JCX17C9TDXM) } generation=1 healthy_actors=[] failed_or_unhealthy_actors=["SourceActor-floral-a2Tz", "quickwit_indexing::actors::doc_processor::DocProcessor-polished-sxQZ", "Indexer-frosty-l45l", "quickwit_indexing::actors::index_serializer::IndexSerializer-snowy-haD5", "Packager-floral-xkMG", "IndexUploader-dark-h3aq", "quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-red-qRI6", "Publisher-summer-RoNq"] success_actors=[]
 Num docs 279452045 Parse errs     2 PublSplits 161 Input size 110570MB Thrghput 25.56MB/s Time 00:27:13
2024-04-01T11:20:09.721Z ERROR quickwit_actors::spawn_builder: actor-failure cause=No such file or directory (os error 2) at path "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6Q7SDBKYJSZ7GWDF1BQ2-fvNUDh" exit_status=Failure(No such file or directory (os error 2) at path "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6Q7SDBKYJSZ7GWDF1BQ2-fvNUDh")
2024-04-01T11:20:09.721Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=Indexer-lively-X4Fw exit_status=Failure(No such file or directory (os error 2) at path "/root/tmp/quickwit-test/qwdata/indexing/server_logs%01HTCKC0P2HQM61NYMCSKS9P29%_ingest-cli-source%01HTCMMVKZDHYR2JCX17C9TDXM%DRFb2x/split-01HTCP6Q7SDBKYJSZ7GWDF1BQ2-fvNUDh")
2024-04-01T11:20:09.729Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-green-KTDh exit_status=DownstreamClosed
 Num docs 279452045 Parse errs     2 PublSplits 161 Input size 110570MB Thrghput 17.76MB/s Time 00:27:14
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="SourceActor-green-KTDh"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::doc_processor::DocProcessor-frosty-TOeC"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Indexer-lively-X4Fw"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::index_serializer::IndexSerializer-lingering-fYsB"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Packager-frosty-WMfB"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="IndexUploader-dry-IGTC"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-long-KtXl"
2024-04-01T11:20:10.663Z ERROR quickwit_actors::actor_handle: actor-exit-without-success actor="Publisher-lingering-Gsjd"
2024-04-01T11:20:10.663Z ERROR quickwit_indexing::actors::indexing_pipeline: Indexing pipeline failure. pipeline_id=IndexingPipelineId { node_id: "corp1", index_uid: IndexUid { index_id: "server_logs", incarnation_id: Ulid(2069641637541167640949149121706776649) }, source_id: "_ingest-cli-source", pipeline_uid: Pipeline(01HTCMMVKZDHYR2JCX17C9TDXM) } generation=2 healthy_actors=[] failed_or_unhealthy_actors=["SourceActor-green-KTDh", "quickwit_indexing::actors::doc_processor::DocProcessor-frosty-TOeC", "Indexer-lively-X4Fw", "quickwit_indexing::actors::index_serializer::IndexSerializer-lingering-fYsB", "Packager-frosty-WMfB", "IndexUploader-dry-IGTC", "quickwit_indexing::actors::sequencer::Sequencer<quickwit_indexing::actors::publisher::Publisher>-long-KtXl", "Publisher-lingering-Gsjd"] success_actors=[]
 Num docs 279453367 Parse errs     2 PublSplits 161 Input size 110571MB Thrghput 11.26MB/s Time 00:27:15

Cancelling the second one does not recover the first one, it continues to output such errors in a loop.

Expected behavior

Both ingests work.

Configuration:

index_config.yaml:

version: 0.7
index_id: server_logs
doc_mapping:
  mode: dynamic
  field_mappings:
    - name: datetime
      type: datetime
      fast: true
      input_formats:
        - iso8601
      output_format: iso8601
      fast_precision: seconds
      fast: true
    - name: git
      type: text
      tokenizer: raw
    - name: hostname
      type: text
      tokenizer: raw
    - name: level
      type: text
      tokenizer: raw
    - name: message
      type: text
  timestamp_field: datetime

search_settings:
  default_search_fields: [message]

indexing_settings:
  commit_timeout_secs: 10
nh2 commented 6 months ago

I guess that's this:

https://quickwit.io/docs/configuration/metastore-config#examples

The file-backed metastore does not support multiple instances running at the same time because it does not implement any locking mechanism to prevent concurrent writes from overwriting each other. Ensure that only one file-backed metastore instance is running at all times.

That sounds like a bad idea: Having the file-backed metastore be the the default without any notice to the user for the lack of concurrent usage.

I recommend to use at least some basic advisory locking (man 2 flock) so that the second local-ingest can output something like

Waiting while another process has the file-backed metastore lock...

This can usually be implemented with just a few lines of code.