Open jacketsj opened 3 weeks ago
This difference here is likely due to the fact your test is reading the dataset in order without any sharding. This means you're getting the added benefit of batch and fragment read aheads.
By default, the LanceDataset
uses ShardedFragmentSampler
which has no fragment level read ahead. If you set shard_granularity="batch"
you'll probably get the same performance under your test.
However, in a real world setting where shuffle=True
, dataloader num_workers>1
and torch dist workers > 1, the fragment sampler will be faster.
If you want to compare, I have an experimental fragment-level sharded sampler with read-ahead here: https://gist.github.com/tonyf/7087dd3130ee5df1e93b862d55230f1c
And, a row-id based sharded sampler: https://gist.github.com/tonyf/d512e26183d97eb4fbae9c0b6abe5072 which is more strictly correct under sharded scenarios though it is not as fast
more on all of this here: https://github.com/lancedb/lance/discussions/2781
Thanks for the idea, although I'm currently convinced that it's from this function, which copies data into numpy as an intermediary (at least in 0.16.1). That's similar to what I was doing with to_batches
before using frombuffer
, which obtained near-identical performance to the torch integration above.
There has been a significant change to that function since 0.16.1, so I need to fix #2803 (likely caused by that same change). These are not urgent atm, but should hopefully be quick to fix once I get to them.
I was writing some gpu code that iterates through an entire dataset, and I found that the torch integration was quite slow. I wrote up a simple to_batches-based implementation of what I was working on and found it to be much faster.
It is quite likely there are some cases my implementation does not handle, but at least in this case there is significant performance to be gained.
Here's some code that reproduces the issue (tested on lance 0.16.1, due to #2803):
Output (warnings removed):