Open laurids-reichardt opened 2 years ago
Just compiled the latest main branch, including this PR: https://github.com/quickwit-oss/quickwit/pull/1717
Doesn't solve the issue in this case, but the error message changes to the following:
2022-07-16T01:44:26.987Z ERROR fetch_docs: quickwit_search::fetch_docs: Error when fetching docs in splits. split_ids=["01G7YST0ZMXB9VKBQCR3B6JAEN", "01G7YY6GH3N259P137D7TF6PA7", "01G7YXYNFFVCQBY8Y2B5078EWW"] error=searcher-doc-async
Caused by:
An IO error occurred: 'Error obtaining chunk: error reading a body from connection: unexpected end of file Ctx:Failed to fetch slice 1420380118..1420730462 for object: s3://test-bucket/indexes/reddit_comments_v05/01G7YST0ZMXB9VKBQCR3B6JAEN.split'
Trying to reproduce this on Cloudflare with the hdfs dataset without success.
/search?query=severity_text:ERROR&max_hits=3000
with 260
num_hits runs fine/search?query=severity_text:INFO&max_hits=3000
with 2716636
num_hits runs fine leaf_search index="hdfs_large" splits=[
SplitIdAndFooterOffsets { split_id: "01G92CG810QEQ8FN7G18ZMQF4S", split_footer_start: 28996247, split_footer_end: 29004531 },
SplitIdAndFooterOffsets { split_id: "01G92CJ2NKBBNVQAF48AWY3ZED", split_footer_start: 28097070, split_footer_end: 28105322 },
SplitIdAndFooterOffsets { split_id: "01G92CKXC20W7625X3493EAYC3", split_footer_start: 28951191, split_footer_end: 28959463 },
SplitIdAndFooterOffsets { split_id: "01G92CNR3BZ62FYJJX0N3WVNHC", split_footer_start: 30100613, split_footer_end: 30109064 },
SplitIdAndFooterOffsets { split_id: "01G92D7AG96SBF0SRJTX6HB89W", split_footer_start: 30082968, split_footer_end: 30091340 },
SplitIdAndFooterOffsets { split_id: "01G92D95820XT2JKRFNYHP66CQ", split_footer_start: 31186028, split_footer_end: 31194332 },
SplitIdAndFooterOffsets { split_id: "01G92DAZXR9XPQRJYV1MVFADFP", split_footer_start: 29952318, split_footer_end: 29960094 }
]
@laurids-reichardt it seems you are using an open dataset, can you please drop the link, you index config could also help in running with the exact field options?
@evanxg852000 Yes, here are the steps to reproduce a similar setup:
# download the dataset
curl -O https://files.pushshift.io/reddit/comments/RC_2022-04.zst
# decompress and ingest dataset via CLI
zstd -d --stdout RC_2022-04.zst --long=31 | ./quickwit index ingest --index reddit_comments_v05
Index config:
version: 0
index_id: reddit_comments_v05
index_uri: "s3://test-bucket/indexes/reddit_comments_v05"
doc_mapping:
field_mappings:
- name: created_utc
type: i64
fast: true
- name: body
type: text
tokenizer: en_stem
record: position
indexing_settings:
timestamp_field: created_utc
commit_timeout_secs: 1200
resources:
- num_threads: 4
search_settings:
default_search_fields: [body]
My original setup included the base36 to i64 converted comment_id (name), of the above-mentioned dataset, as another fast field.
- name: comment_id
type: i64
fast: true
It's a bit more involved to get this value inside the index, as it depends on clickhouse-local for preprocessing. If you believe this to be relevant, I'm happy to provide more instructions on how to do this.
Describe the bug Search requests with high
max_hits
(> ~100) causes fetch errors with indexes stored on Cloudflare R2.This is likely because Quickwit goes over the Cloudflare R2 limit of 1000
GetObject
requests per second: https://developers.cloudflare.com/r2/platform/limits/#account-plan-limitsAs discussed on discord, it would be helpful if Quickwit propagates the Cloudflare R2 error message to confirm this theory.
Steps to reproduce (if applicable) Steps to reproduce the behavior:
Configuration: Please provide:
quickwit --version