[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
I get the error shown below when I try to use Polars to read data from delta lake. My delta lake is Non-AWS (Ceph based).
The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the delta-rs python binding.
---------------------------------------------------------------------------
PyDeltaTableError Traceback (most recent call last)
Cell In[6], line 1
----> 1 pl_data = pl.read_delta(source=table_uri, storage_options=storage_options)
2 print(pl_data)
File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:136, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
134 if len(args) > num_allowed_args:
135 warnings.warn(msg, DeprecationWarning, stacklevel=stacklevel)
--> 136 return function(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:37, in deprecated_alias.<locals>.deco.<locals>.wrapper(*args, **kwargs)
34 @functools.wraps(function)
35 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
36 _rename_kwargs(function.__name__, kwargs, aliases, stacklevel=stacklevel)
---> 37 return function(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/polars/io/delta.py:141, in read_delta(source, version, columns, storage_options, delta_table_options, pyarrow_options)
132 resolved_uri = _resolve_delta_lake_uri(source)
134 dl_tbl = _get_delta_lake_table(
135 table_path=resolved_uri,
136 version=version,
137 storage_options=storage_options,
138 delta_table_options=delta_table_options,
139 )
--> 141 return from_arrow(dl_tbl.to_pyarrow_table(columns=columns, **pyarrow_options))
File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
386 def to_pyarrow_table(
387 self,
388 partitions: Optional[List[Tuple[str, str, Any]]] = None,
389 columns: Optional[List[str]] = None,
390 filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
391 ) -> pyarrow.Table:
392 """
393 Build a PyArrow Table using data from the DeltaTable.
394
(...)
398 :return: the PyArrow table
399 """
--> 400 return self.to_pyarrow_dataset(
401 partitions=partitions, filesystem=filesystem
402 ).to_table(columns=columns)
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)
The reading of table works fine if the data is small in size, for example few 10's of MB. Seems like this problem only happens for data which is big in size. I get the same error when reading data using delta-rs to_pandas() and to_pyarrow_dataset() function.
I am expecting the data to be read from delta lake into dataframe. I am able to read the same data from Pyspark so that confirms nothing is wrong with my delta table.
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
I get the error shown below when I try to use Polars to read data from delta lake. My delta lake is Non-AWS (Ceph based).
The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the delta-rs python binding.
Environment: Delta-rs version: 0.8.1 Binding: Python Docker container: Python: 3.10.10 OS: Debian GNU/Linux 11 (bullseye) S3: Non-AWS (Ceph based)
The reading of table works fine if the data is small in size, for example few 10's of MB. Seems like this problem only happens for data which is big in size. I get the same error when reading data using delta-rs to_pandas() and to_pyarrow_dataset() function.
I have opened the same issue on delta-rs, but no help so far: https://github.com/delta-io/delta-rs/issues/1256
Reproducible example
Expected behavior
I am expecting the data to be read from delta lake into dataframe. I am able to read the same data from Pyspark so that confirms nothing is wrong with my delta table.
Installed versions