pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.5k stars 1.98k forks source link

read_avro - exceptions ComputeError: unable to parse time zone: '00:00'. Please check the Time Zone Database #13032

Closed match-gabeflores closed 11 months ago

match-gabeflores commented 11 months ago

Checks

Reproducible example

See attached file - rename .txt to .avro.

out.avro.txt

Log output

Traceback (most recent call last):
  File "C:\Users\gabriel.flores\Documents\Match\files\airflow python gabe_scripts\polars poc.py", line 76, in <module>
    df = pl.read_avro(file)
  File "C:\Users\gabriel.flores\Documents\pycharm_venv\venv1\lib\site-packages\polars\io\avro.py", line 40, in read_avro
    return pl.DataFrame._read_avro(source, n_rows=n_rows, columns=columns)
  File "C:\Users\gabriel.flores\Documents\pycharm_venv\venv1\lib\site-packages\polars\dataframe\frame.py", line 902, in _read_avro
    self._df = PyDataFrame.read_avro(source, columns, projection, n_rows)
polars.exceptions.ComputeError: unable to parse time zone: '00:00'. Please check the Time Zone Database for a list of available time zones

Issue description

Similar to #9586 , when reading an avro file with time zone included, I receive this error

polars.exceptions.ComputeError: unable to parse time zone: '00:00'. Please check the Time Zone Database for a list of available time zones

The avro file is generated from DataBricks so it should be good. I also have no problems reading via fastavro.

No problems reading via:

Expected behavior

Should be able to read avro file containing offset information

Installed versions

``` --------Version info--------- Polars: 0.19.19 Platform: Windows-10-10.0.19045-SP0 Python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.9.2 gevent: matplotlib: numpy: 1.24.2 openpyxl: pandas: 2.1.4 pyarrow: 13.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.23 xlsx2csv: xlsxwriter: ```
MarcoGorelli commented 11 months ago

hi @match-gabeflores - how did you generate the file? is '00:00' part of the dtype?

match-gabeflores commented 11 months ago

Hi @MarcoGorelli, it's generated via DataBricks by a third party. I can ask them.

Also, i updated my issue above with an example file and added more notes.

Note, this example file has only one record and was generated by using the DataBricks file and filtering for one row and removing PII data).

match-gabeflores commented 11 months ago

Thanks @MarcoGorelli for the quick fix!

Do you know when this fix will be a stable release?

MarcoGorelli commented 11 months ago

I don't know, but I'd guess this or next week

thanks for your excellent report, this is how software gets better 🙌