Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Didn't dig into the cause of this, but the documentation and docstring (below) for LanceFragment.to_batches have the wrong function signature and details.
Fragment.to_batches(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)
Read the fragment as materialized record batches.
Parameters
----------
schema : Schema, optional
Concrete schema to use for scanning.
columns : list of str, default None
The columns to project. This can be a list of column names to
include (order and duplicates will be preserved), or a dictionary
with {new_column_name: expression} values for more advanced
projections.
The list of columns or expressions may use the special fields
`__batch_index` (the index of the batch within the fragment),
`__fragment_index` (the index of the fragment within the dataset),
`__last_in_fragment` (whether the batch is last in fragment), and
`__filename` (the name of the source file or a description of the
source fragment).
The columns will be passed down to Datasets and corresponding data
fragments to avoid loading, copying, and deserializing columns
that will not be required further down the compute chain.
By default all of the available columns are projected. Raises
an exception if any of the referenced column names does not exist
in the dataset's Schema.
filter : Expression, default None
Scan will return only the rows matching the filter.
If possible the predicate will be pushed down to exploit the
partition information or internal metadata found in the data
source, e.g. Parquet statistics. Otherwise filters the loaded
RecordBatches before yielding them.
batch_size : int, default 131_072
The maximum row count for scanned record batches. If scanned
record batches are overflowing memory then this method can be
called to reduce their size.
batch_readahead : int, default 16
The number of batches to read ahead in a file. This might not work
for all file formats. Increasing this number will increase
RAM usage but could also improve IO utilization.
fragment_readahead : int, default 4
The number of files to read ahead. Increasing this number will increase
RAM usage but could also improve IO utilization.
fragment_scan_options : FragmentScanOptions, default None
Options specific to a particular scan and fragment type, which
can change between different scans of the same dataset.
use_threads : bool, default True
If enabled, then maximum parallelism will be used determined by
the number of available CPU cores.
memory_pool : MemoryPool, default None
For memory allocations, if required. If not specified, uses the
default pool.
Returns
-------
record_batches : iterator of RecordBatch
Didn't dig into the cause of this, but the documentation and docstring (below) for
LanceFragment.to_batches
have the wrong function signature and details.The actual function is
I won't comment on all the differences, but note for example the lack of a
fragment_scan_options
argument.