lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.96k stars 219 forks source link

[bug] Using gpu to train index would raise row_id out of range issue. #1662

Closed jerryyifei closed 11 months ago

jerryyifei commented 11 months ago

Python: 3.8.9 pylance: 0.8.17

File "create_index.py", line 6, in full_ds.create_index( File "/usr/local/lib/python3.8/dist-packages/lance/dataset.py", line 992, in create_index ivf_centroids = train_ivf_centroids_on_accelerator( File "/usr/local/lib/python3.8/dist-packages/lance/vector.py", line 173, in train_ivf_centroids_on_accelerator kmeans.fit(ds) File "/usr/local/lib/python3.8/dist-packages/lance/torch/kmeans.py", line 135, in fit self.total_distance = self._fit_once( File "/usr/local/lib/python3.8/dist-packages/lance/torch/kmeans.py", line 181, in _fit_once for idx, chunk in enumerate(data): File "/usr/local/lib/python3.8/dist-packages/lance/torch/data.py", line 122, in iter for batch in stream: File "/usr/local/lib/python3.8/dist-packages/lance/cache.py", line 68, in iter for batch in self.stream: File "/usr/local/lib/python3.8/dist-packages/lance/sampler.py", line 134, in maybe_sample yield from _efficient_sample(dataset, n, columns, batch_size, max_takes) File "/usr/local/lib/python3.8/dist-packages/lance/sampler.py", line 74, in _efficient_sample dataset.take( File "/usr/local/lib/python3.8/dist-packages/lance/dataset.py", line 452, in take return pa.Table.from_batches([self._ds.take(indices, columns)]) OSError: Invalid user input: Row index 26514819 is beyond the range of the dataset., /home/runner/work/lance/lance/rust/lance/src/dataset.rs:892:31

plaggy commented 11 months ago

Likely related: when indexing without an accelerator, get a similar index out of range error. My guess is it may be related to the "KMeans: cluster 186 is empty" but how do I make sure there are no empty clusters then?

[2023-11-26T10:11:45Z WARN  lance_linalg::kmeans] KMeans: cluster 186 is empty
  0%|                                                                                                                             | 0/1000 [00:00<?, ?it/s]thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance_background_thread' panicked at /home/runner/work/lance/lance/rust/lance/src/utils/tokio.rs:30:24:
called `Result::unwrap()` on an `Err` value: RecvError(())
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
thread 'lance_background_thread' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/task/core.rs:375:22:
JoinHandle polled after completion
thread 'lance_background_thread' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/task/core.rs:375:22:
JoinHandle polled after completion
  0%|                                                                                                                             | 0/1000 [00:00<?, ?it/s]
thread 'lance-cpu' panicked at /home/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:116:14:
range end index 229376 out of range for slice of length 228480
Traceback (most recent call last):
  File "/home/ubuntu/benchmarking/benchmark.py", line 219, in <module>
    eval_lance(query, ground_truth, 5, "cosine")
  File "/home/ubuntu/benchmarking/benchmark.py", line 146, in eval_lance
    res = tbl.search(list(q)).metric(metric).limit(k).refine_factor(ref_factor).to_list()
  File "/home/ubuntu/.pyenv/versions/rag/lib/python3.10/site-packages/lancedb/query.py", line 217, in to_list
    return self.to_arrow().to_pylist()
  File "/home/ubuntu/.pyenv/versions/rag/lib/python3.10/site-packages/lancedb/query.py", line 409, in to_arrow
    return self._table._execute_query(query)
  File "/home/ubuntu/.pyenv/versions/rag/lib/python3.10/site-packages/lancedb/table.py", line 970, in _execute_query
    return ds.to_table(
  File "/home/ubuntu/.pyenv/versions/rag/lib/python3.10/site-packages/lance/dataset.py", line 321, in to_table
    ).to_table()
  File "/home/ubuntu/.pyenv/versions/rag/lib/python3.10/site-packages/lance/dataset.py", line 1595, in to_table
    return self.to_reader().read_all()
  File "pyarrow/ipc.pxi", line 757, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Io error: Execution error: External error: Execution error: ExecNode(Take): thread panicked: task 3247216 panicked
amulil commented 11 months ago

@wjones127 thanks, when will it be merged into main branch? Or, how can I fix it in the current version?

rok commented 11 months ago

@amulil this was merged to main and we'll probably cut a new release including this fix within a week.