Open sergeyignatov opened 1 month ago
@trinity-1686a any idea what happened here?
the error message is a bit misleading. iiuc, this error happens when a split was expected to be Published
and is actually not.
The only situation where i can imagine a split being Staged
according to the metastore, but a merge candidate for an indexer, is if the metastore was rolled back. That can happen easily with the S3/GCS metastore (if multiple metastores are running and they fight each other), not so much on postgres.
The more likely scenario is the splits are already MarkedForDeletion
, which could happen due to the same kind of issues, but also other reasons. The only persistent issue that comes to mind would be that two indexers have the same node_id, so they try to merge splits from each others.
Do you know if some of your indexes have the same node_id? (either explicitly configured, or same short hostname)? If you search for 01J7WT2D80WANG4V0EN4FVZHKF
in all your logs, do you get logs for more than one indexer?
The split 01J7WT2D80WANG4V0EN4FVZHKF is present only in the quickwit-indexer-27 and metastore logs.
2024-09-16 07:13:19.5342024-09-16T07:13:19.533Z INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-122024-09-16 07:13:19.5342024-09-16T07:13:19.534Z INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12 | 2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.533Z INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.534Z INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.533Z INFO stage_splits{split_ids="[\"01J7WTV5YEJ4KCD5B27SZ2XMXS\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.534 | 2024-09-16T07:13:19.534Z INFO stage_splits{split_ids="[\"01J7WTV5JQ2HW1XD7XEWB5CBFG\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.634Z WARN publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-2024-09-12", incarnation_id: Ulid(2086843892575080035060981602672116818) }), staged_split_ids: ["01J7WTQKDHE1DSZJCC8HZZBM6V"], replaced_split_ids: ["01J7WSVJXY8G25A2Y99V580GNC", "01J7WT2D80WANG4V0EN4FVZHKF", "01J7WT5NG5PNZ571V6PGBMH5TZ", "01J7WTQ9BFBHCHRXT9MXZ265DA", "01J7WTKJSX9DZ5GACD8GGK3E5T", "01J7WTCEM5PXZ6BCCN8CP9YG2B", "01J7WSQS4WYMB8V98SJSA89BP4", "01J7WT8X7M3YNXWTYB5SVQM6X9", "01J7WTFVCS7S8NBTFRCN0ZNWAB", "01J7WSZ07FDD1DB8SE1M8ZX2ZT"], index_checkpoint_delta_json_opt: None, publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: rollback
2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-052024-09-16 07:13:19.7572024-09-16T07:13:19.755Z INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-05
2024-09-16 07:13:19.757 | 2024-09-16T07:13:19.755Z INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
2024-09-16 07:13:19.638 | 2024-09-16T07:13:19.636Z INFO publish_splits{request=PublishSplitsRequest { index_uid: Some(IndexUid { index_id: "fluentbit-logs-connect-2024-02-05", incarnation_id: Ulid(2068240375732922833404597870727557083) }), staged_split_ids: ["01J7WTV646WZ63XE6NXYJVYVD1"], replaced_split_ids: [], index_checkpoint_delta_json_opt: Some("{\"source_id\":\"_ingest-api-source\",\"source_delta\":{\"per_partition\":{\"ingest_partition_01J7WNDBAVYDR8BBZB14NBGMA4\":{\"from\":\"00000000000000449363\",\"to\":\"00000000000000458417\"}}}}"), publish_token_opt: None }}: quickwit_metastore::metastore::postgres::metastore: published 1 splits and marked 0 for deletion successfully index_id=fluentbit-logs-connect-2024-02-05
2024-09-16 07:13:19.757 | 2024-09-16T07:13:19.755Z INFO stage_splits{split_ids="[\"01J7WTV66RWB63VWTKNGH999D4\"]"}: quickwit_metastore::metastore::postgres::metastore: staged `1` splits successfully index_id=fluentbit-logs-2024-09-12
Also in the affected index the "timestamp range start" not equal to "Timestamp range end " - retention
General Information
--------------------------------------------+-------------------------------------------------------------------------------------
Index ID | fluentbit-logs-2024-09-12
Number of published documents | 357.045612 M (357,045,612)
Size of published documents (uncompressed) | 499.8 GB
Number of published splits | 822
Size of published splits | 149.7 GB
Timestamp field | "timestamp"
Timestamp range start | 2024-09-06 06:11:43 (Timestamp: 1725603103)
Timestamp range end | 2024-09-17 18:15:21 (Timestamp: 1726596921)
How likely a logs with the timestamp older than now() - "retention days" may break the retention logic ?
the retention logic looks at the timestamp range end (of individual splits).
If quickwit receives a batch of documents that are retention days
old, and make a split only of that (with no recent document), the retention policy will mark the split for deletion on the next round. If the indexer tries to merge it after that, it will probably get the error you're getting
we started to drop logs with timestamp in the far past on the fluent-bit side and the issue has been resolved.
It would be good to have such protection on the quickwit ingestion side.
i got the same error log, it occurs when a reaching retention period split is deleted ,does this situation affect the incomming ingested data?i wonder know if the incomming data will lost or not? @trinity-1686a
when a precondition fails on a merge, the splits used in the merge stay in the same state. They don't get marked for deletion until the merge succeeded. This may result in sub-optimal search (it's faster to search a single large split than a dozen smaller ones), but this doesn't cause loss or duplication of data
Hi, We have 4 indexes in quickwit, all indexes are in GCS, quickwit version 0.8.2. Recently started issue with the biggest index fluentbit-logs-2024-09-12 when splits older 1h is automatically removed. Other indexes are ok. The issue is not gone after the index recreation.
The affected index has way more Staged splits than other indexes.
In the indexer logs the only errors:
index config