lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.47k stars 180 forks source link

fix: rework decoding to fix bugs in nested struct decoding #2337

Closed westonpace closed 4 weeks ago

westonpace commented 1 month ago

This PR modifies both scheduling and the decode stream. These modifications should mainly affect the logical encodings (list / struct). The changes to the other encodings are fairly minimal.

In the scheduler we now track where we are in the field tree. This is both for debugging purposes (so we can log in trace messages the current path) and because we now need to send the path as part of the message we send to the decode stream.

The decoder changes are more significant. Previously, we combined waiting for I/O and waiting for additional encoders into the same phase. This logic was more complex, but more importantly, it also assumed that the decoder could recreate the order in which the scheduler scheduled pages. For example, if we scan through the columns and encounter one that needs more data, then we grab exactly one page, and continue the column scan, grabbing the next page when we come by in another pass. This works in many cases but doesn't work in others. For example, the scheduler might schedule the many pages in a row for the same column if that column is wide or the page size is small. The decoder would get out of sync in these cases.

Now, the logic is simpler, and more correct. The decoder first waits for enough scheduling to be done that a batch can be delivered. During this pass we drain encoders from the scheduler and insert them into the current decoders not by "current position in the decode" but by the path that is now included with the decoder.

Once enough scheduling has been done we then wait for I/O.

Once I/O is done we then drain the decoders in the same fashion we did before.

westonpace commented 1 month ago

Leaving in draft until #2331 is complete

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 70.46784% with 101 lines in your changes are missing coverage. Please review.

Project coverage is 80.77%. Comparing base (00cda83) to head (7698a10).

Files Patch % Lines
rust/lance-encoding/src/decoder.rs 72.14% 34 Missing and 5 partials :warning:
...-encoding/src/encodings/logical/fixed_size_list.rs 0.00% 21 Missing :warning:
...ust/lance-encoding/src/encodings/logical/struct.rs 82.88% 14 Missing and 5 partials :warning:
rust/lance-encoding/src/encodings/logical/list.rs 65.71% 6 Missing and 6 partials :warning:
...ust/lance-encoding/src/encodings/logical/binary.rs 73.33% 3 Missing and 1 partial :warning:
.../lance-encoding/src/encodings/logical/primitive.rs 66.66% 3 Missing :warning:
rust/lance-file/src/v2/reader.rs 0.00% 1 Missing and 1 partial :warning:
rust/lance-encoding/src/testing.rs 75.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2337 +/- ## ========================================== - Coverage 80.80% 80.77% -0.04% ========================================== Files 192 192 Lines 56189 56343 +154 Branches 56189 56343 +154 ========================================== + Hits 45403 45509 +106 - Misses 8117 8169 +52 + Partials 2669 2665 -4 ``` | [Flag](https://app.codecov.io/gh/lancedb/lance/pull/2337/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/lancedb/lance/pull/2337/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb) | `80.77% <70.46%> (-0.04%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.