lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.93k stars 215 forks source link

Scanning dataset with scalar indices results in `Generic S3 Error: error decoding response body` #2839

Open tonyf opened 2 months ago

tonyf commented 2 months ago

I have a remote dataset stored on s3. Without scalar indices, using the scanner api with a filter works fine. However, once a scalar index is added, I get a OSError: Io error: Execution error: Wrapped error: LanceError(IO): Generic S3 error: error decoding response body

import lance

dataset = lance.dataset("s3://...")

fragments = [dataset.get_fragment(idx) for idx in fragment_idxs]
scanner = dataset.scanner(
    filter="source in ('imagenet')",
    columns=["image", "split", "source"],
    batch_size=4,
    fragments=fragments,
    batch_readahead=16,
    fragment_readahead=4,
)
for batch in scanner.to_batches():
    pass

Exception:

  File "/home/tony/workspace/models/data/utils/datasets/lance/sampler.py", line 348, in iter_batches
    for idx, batch in enumerate(scanner.to_batches()):
  File "/home/tony/workspace/models/.venv/lib/python3.10/site-packages/lance/dataset.py", line 2519, in to_batches
    yield from self.to_reader()
  File "pyarrow/ipc.pxi", line 671, in pyarrow.lib.RecordBatchReader.__next__
  File "pyarrow/ipc.pxi", line 705, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Io error: Execution error: Wrapped error: LanceError(IO): Generic S3 error: error decoding response body, /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5, /home/runner/work/lance/lance/rust/lance-io/src/scheduler.rs:258:27

If I checkout the dataset to a version without the scalar index, no exception is raised

westonpace commented 1 month ago

Are you able to reproduce this reliably? The error error decoding response body is a bit of a generic error given by our HTTP client library.

The plan generated by that scan will be a "late materialized scan". First it will read the imagenet column to get the valid row ids. Then it will generate a take operation to grab the remaining columns from those row ids. If there is a scalar index then we skip reading the filter column and just scan the index to figure out which row ids to grab.

The only thing that jumps to mind is that maybe we are triggering more concurrent reads to S3 when a scalar index is used? I think the last time I encountered that error message it was related to reading very large values in a single request from S3.

Can you generate a trace and attach it? Just put this at the beginning on the script:

from lance.tracing import trace_to_chrome
trace_to_chrome(file="/some/file.json", level="debug")

Then it will generate /some/file.json. You can attach that to this issue and we can take a look. Also, you can use https://ui.perfetto.dev/ if you want to look at the trace yourself.