pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.26k stars 1.85k forks source link

PyDeltaTableError: Generic S3 error: Error performing get request #8008

Closed shazamkash closed 3 weeks ago

shazamkash commented 1 year ago

Polars version checks

Issue description

I get the error shown below when I try to use Polars to read data from delta lake. My delta lake is Non-AWS (Ceph based).

The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the delta-rs python binding.

Environment: Delta-rs version: 0.8.1 Binding: Python Docker container: Python: 3.10.10 OS: Debian GNU/Linux 11 (bullseye) S3: Non-AWS (Ceph based)

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 pl_data = pl.read_delta(source=table_uri, storage_options=storage_options)
      2 print(pl_data)

File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:136, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    134 if len(args) > num_allowed_args:
    135     warnings.warn(msg, DeprecationWarning, stacklevel=stacklevel)
--> 136 return function(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:37, in deprecated_alias.<locals>.deco.<locals>.wrapper(*args, **kwargs)
     34 @functools.wraps(function)
     35 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     36     _rename_kwargs(function.__name__, kwargs, aliases, stacklevel=stacklevel)
---> 37     return function(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/polars/io/delta.py:141, in read_delta(source, version, columns, storage_options, delta_table_options, pyarrow_options)
    132 resolved_uri = _resolve_delta_lake_uri(source)
    134 dl_tbl = _get_delta_lake_table(
    135     table_path=resolved_uri,
    136     version=version,
    137     storage_options=storage_options,
    138     delta_table_options=delta_table_options,
    139 )
--> 141 return from_arrow(dl_tbl.to_pyarrow_table(columns=columns, **pyarrow_options))

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

The reading of table works fine if the data is small in size, for example few 10's of MB. Seems like this problem only happens for data which is big in size. I get the same error when reading data using delta-rs to_pandas() and to_pyarrow_dataset() function.

I have opened the same issue on delta-rs, but no help so far: https://github.com/delta-io/delta-rs/issues/1256

Reproducible example

import polars as pl

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_path = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
pl_data = pl.read_delta(source=table_path , storage_options=storage_options)

Expected behavior

I am expecting the data to be read from delta lake into dataframe. I am able to read the same data from Pyspark so that confirms nothing is wrong with my delta table.

Installed versions

``` ---Version info--- Polars: 0.16.18 Index type: UInt32 Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.35 Python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] ---Optional dependencies--- numpy: 1.23.5 pandas: 2.0.0 pyarrow: 11.0.0 connectorx: deltalake: 0.8.1 fsspec: 2023.3.0 matplotlib: 3.7.1 xlsx2csv: xlsxwriter: ```
chitralverma commented 1 year ago

Need to track this on delta-rs side. Thanks for raising the ticket

ion-elgreco commented 3 weeks ago

@stinodego you can close this one : ) It's not relevant anymore