Search requests with high `max_hits` (> ~100) causes fetch error with indexes stored on Cloudflare R2.

laurids-reichardt commented 2 years ago

Describe the bug Search requests with high max_hits (> ~100) causes fetch errors with indexes stored on Cloudflare R2.

This is likely because Quickwit goes over the Cloudflare R2 limit of 1000 GetObject requests per second: https://developers.cloudflare.com/r2/platform/limits/#account-plan-limits

As discussed on discord, it would be helpful if Quickwit propagates the Cloudflare R2 error message to confirm this theory.

Steps to reproduce (if applicable) Steps to reproduce the behavior:

Search request:

❯ curl http://127.0.0.1:7280/api/v1/reddit_comments_v05/search\?query\=frankfurt\&max_hits\=1000\&sort_by_field\=-created_utc
{
"InternalError": "Internal error: `Error when fetching docs for splits [\"01G7YXYNFFVCQBY8Y2B5078EWW\", \"01G7YXPQQM8DER0F8112WK5C2P\", \"01G7YY6GH3N259P137D7TF6PA7\", \"01G7YWQHPA2Q1PHK8B0BXS0ZJP\"]: searcher-doc-async\n\nCaused by:\n    An IO error occurred: 'Failed to fetch slice 1448721986..1449072261 for object: s3://test-bucket/indexes/reddit_comments_v05/01G7YXPQQM8DER0F8112WK5C2P.split'.`."
}%

Configuration: Please provide:

Output of quickwit --version

❯ RUST_LOG=quickwit=debug bin/quickwit --version
Quickwit 0.3.1nightly

laurids-reichardt commented 2 years ago

Just compiled the latest main branch, including this PR: https://github.com/quickwit-oss/quickwit/pull/1717

Doesn't solve the issue in this case, but the error message changes to the following:

2022-07-16T01:44:26.987Z ERROR fetch_docs: quickwit_search::fetch_docs: Error when fetching docs in splits. split_ids=["01G7YST0ZMXB9VKBQCR3B6JAEN", "01G7YY6GH3N259P137D7TF6PA7", "01G7YXYNFFVCQBY8Y2B5078EWW"] error=searcher-doc-async

Caused by:
    An IO error occurred: 'Error obtaining chunk: error reading a body from connection: unexpected end of file Ctx:Failed to fetch slice 1420380118..1420730462 for object: s3://test-bucket/indexes/reddit_comments_v05/01G7YST0ZMXB9VKBQCR3B6JAEN.split'

evanxg852000 commented 2 years ago

Trying to reproduce this on Cloudflare with the hdfs dataset without success.

/search?query=severity_text:ERROR&max_hits=3000 with 260 num_hits runs fine
/search?query=severity_text:INFO&max_hits=3000 with 2716636 num_hits runs fine

leaf_search index="hdfs_large" splits=[
    SplitIdAndFooterOffsets { split_id: "01G92CG810QEQ8FN7G18ZMQF4S", split_footer_start: 28996247, split_footer_end: 29004531 }, 
    SplitIdAndFooterOffsets { split_id: "01G92CJ2NKBBNVQAF48AWY3ZED", split_footer_start: 28097070, split_footer_end: 28105322 }, 
    SplitIdAndFooterOffsets { split_id: "01G92CKXC20W7625X3493EAYC3", split_footer_start: 28951191, split_footer_end: 28959463 }, 
    SplitIdAndFooterOffsets { split_id: "01G92CNR3BZ62FYJJX0N3WVNHC", split_footer_start: 30100613, split_footer_end: 30109064 }, 
    SplitIdAndFooterOffsets { split_id: "01G92D7AG96SBF0SRJTX6HB89W", split_footer_start: 30082968, split_footer_end: 30091340 }, 
    SplitIdAndFooterOffsets { split_id: "01G92D95820XT2JKRFNYHP66CQ", split_footer_start: 31186028, split_footer_end: 31194332 }, 
    SplitIdAndFooterOffsets { split_id: "01G92DAZXR9XPQRJYV1MVFADFP", split_footer_start: 29952318, split_footer_end: 29960094 }
]

@laurids-reichardt it seems you are using an open dataset, can you please drop the link, you index config could also help in running with the exact field options?

laurids-reichardt commented 2 years ago

@evanxg852000 Yes, here are the steps to reproduce a similar setup:

# download the dataset
curl -O https://files.pushshift.io/reddit/comments/RC_2022-04.zst

# decompress and ingest dataset via CLI
zstd -d --stdout RC_2022-04.zst --long=31 | ./quickwit index ingest --index reddit_comments_v05

Index config:

version: 0

index_id: reddit_comments_v05

index_uri: "s3://test-bucket/indexes/reddit_comments_v05"

doc_mapping:
  field_mappings:
    - name: created_utc
      type: i64
      fast: true

    - name: body
      type: text
      tokenizer: en_stem
      record: position

indexing_settings:
  timestamp_field: created_utc
  commit_timeout_secs: 1200
  resources:
    - num_threads: 4

search_settings:
  default_search_fields: [body]

My original setup included the base36 to i64 converted comment_id (name), of the above-mentioned dataset, as another fast field.

    - name: comment_id
      type: i64
      fast: true

It's a bit more involved to get this value inside the index, as it depends on clickhouse-local for preprocessing. If you believe this to be relevant, I'm happy to provide more instructions on how to do this.

quickwit-oss / quickwit

Search requests with high `max_hits` (> ~100) causes fetch error with indexes stored on Cloudflare R2. #1777