pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.24k stars 1.75k forks source link

Bug on Write CSV (Rust) #15672

Open antonylebechec opened 3 months ago

antonylebechec commented 3 months ago

Checks

Reproducible example

# Polars write dataframe
pl.from_arrow(d).write_csv(
    file=f,
    separator="\t",
    include_header=False,
    quote_style="never",
)

Log output

thread '<unnamed>' panicked at crates/polars-arrow/src/compute/cast/utf8_to.rs:79:47:
called `Result::unwrap()` on an `Err` value: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel/bin/howard", line 33, in <module>
    sys.exit(load_entry_point('howard', 'console_scripts', 'howard')())
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/main.py", line 273, in main
    eval(f"{command_function}(args)")
  File "<string>", line 1, in <module>
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/tools/annotation.py", line 70, in annotation
    vcfdata_obj.export_output()
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/variants.py", line 2045, in export_output
    database.export(
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/database.py", line 2640, in export
    pl.from_arrow(d).write_csv(
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel/lib/python3.10/site-packages/polars/convert.py", line 612, in from_arrow
    return pl.DataFrame._from_arrow(
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel/lib/python3.10/site-packages/polars/dataframe/frame.py", line 591, in _from_arrow
    arrow_to_pydf(
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel/lib/python3.10/site-packages/polars/utils/_construction.py", line 1604, in arrow_to_pydf
    pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())

Issue description

This python line usually works, but fail probably depending on input data (dataframe). As no explanation is provided for this issue, I'm not able to go deeper...

Expected behavior

A file written...

Installed versions

--------Version info--------- Polars: 0.20.8 Index type: UInt32 Platform: macOS-10.16-x86_64-i386-64bit Python: 3.10.13 (main, Sep 11 2023, 08:39:02) [Clang 14.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fsspec: 2024.3.0 gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.1.0 pyarrow: 13.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter:
owenprough-sift commented 3 months ago

Does the issue occur on the latest version of polars? It looks like you're using polars==0.20.8, but the latest version as of writing is polars==0.20.20. There have been a lot of changes in the past twelve minor versions...

antonylebechec commented 3 months ago

Hi, Yes, I used the last version of polars==0.20.20. I printed another conda environment. My bad. This is now the log/err with the last version:

thread 'polars-5' panicked at crates/polars-arrow/src/compute/cast/utf8_to.rs:112:14:
max string/binary length exceeded: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/bin/howard", line 8, in <module>
    sys.exit(main())
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/main.py", line 273, in main
    eval(f"{command_function}(args)")
  File "<string>", line 1, in <module>
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/tools/annotation.py", line 70, in annotation
    vcfdata_obj.export_output()
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/variants.py", line 2045, in export_output
    database.export(
  File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/database.py", line 2643, in export
    pl.from_arrow(d).write_csv(
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/lib/python3.10/site-packages/polars/convert.py", line 434, in from_arrow
    arrow_to_pydf(
  File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py", line 1121, in arrow_to_pydf
    pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
pyo3_runtime.PanicException: max string/binary length exceeded: TryFromIntError(())

I have to detail that I process huge files. The issue is probably due to resources. Especially, I splitted data and I do not have the crash (so it's probably not a problem of data content).

antonylebechec commented 3 months ago

Oh! I forgot to say thank you for developing polars!!! It's so great and useful! 👍

antonylebechec commented 3 months ago

If it can help, here is the schema of the dataframe:

d.schema=
#CHROM: string
POS: int32
ID: string
REF: string
ALT: string
QUAL: string
FILTER: string
INFO: string
cmdlineluser commented 3 months ago

It seems that error comes from pl.from_arrow so write_csv is never reached.

https://github.com/pola-rs/polars/blob/8ef2e212cf52c318e18d12d38b02c3ad0a918ab2/crates/polars-arrow/src/compute/cast/utf8_to.rs#L112

(you may want to re-title the issue for better visibility)

It may need @ritchie46's attention.

ritchie46 commented 3 months ago

We have a maximum string length of 2^32 bytes. That is, a single string element can hold maximum 4GB of data.

antonylebechec commented 3 months ago

Thanks for your reply. Well, the full pyarrow dataframe is possibly huge, but never include a string of 4GB. Moreover, this full dataframe is chunked to write it by batch.

antonylebechec commented 3 months ago

I tried to write directly from pyarrow, and it returns a segmentation fault

write_options = pa.csv.WriteOptions(
  include_header=header,
  delimiter=export_options.get("delimiter", ""),
  quoting_style="none",
)
pa.csv.write_csv(d, f, write_options=write_options)
ritchie46 commented 3 months ago

It seems they didn't expect that large strings either. :sweat_smile:

antonylebechec commented 3 months ago

So, the full pyarrow dataframe is around 72MB (1,000,000 entries), with an INFO column as a Big String for some entries. But it seems that it's far from max 4GB. I chunked (100,000 entries) and it works. It solves my script, but it's really strange that it's fail for a dataframe not so big...

cmdlineluser commented 3 months ago

Can you provide a runnable example? (along with the chunked version that works)

antonylebechec commented 3 months ago

Unfortunately no. I reproduced the issue only with a huge database (parquet file) of 150Gb, which is not easily sharable. Moreover, I use a code included in a complexe project. I don't known how to easily extract this part.

I can basically explain my code:

# Create a pyarrow dataframe record batch, with a query on a huge database (let's say 10,000,000).
conn = duckdb.connect()
query = "SELECT * FROM read_parquet('huge.parquet')"
chunk_size = 1000000    # **reduce/increase the chunk size to success or fail**
df = conn.execute(query).fetch_record_batch(chunk_size)
# For each chunk dataframe
for d in df:
    # Open file
    with open("my_file.tsv", mode="a") as f:
        # Polars write dataframe
        pl.from_arrow(d).write_csv(
            file=f,
            separator="\t",
            include_header=False,
            quote_style="never",
        )
ritchie46 commented 3 months ago

So, the full pyarrow dataframe is around 72MB (1,000,000 entries), with an INFO column as a Big String for some entries. But it seems that it's far from max 4GB.

Are you certain of that? It triggers a panic in Polars that only occurs on strings with a length of 2^32. The fact that pyarrow segfaults also seems supicious to me.

antonylebechec commented 3 months ago

I mean 72MB max for a value (row/column) in the full pyarrow dataframe. I generated a very big String by concatenating multiple columns, and repeating strings, to obtain finally a 72GB string. I'm not sure that a column (especially INFO big String) is lower than 4GB for all rows (1,000,000). Is that you mean by 2^32? Is it the length of a column or a row or a value? Basically, if I chunk (100,000 instead of 1,000,000), it works. So I guess 2^32 is apply to a column. Am I right?

cjackal commented 3 months ago

I mean 72MB max for a value (row/column) in the full pyarrow dataframe. I generated a very big String by concatenating multiple columns, and repeating strings, to obtain finally a 72GB string. I'm not sure that a column (especially INFO big String) is lower than 4GB for all rows (1,000,000). Is that you mean by 2^32? Is it the length of a column or a row or a value? Basically, if I chunk (100,000 instead of 1,000,000), it works. So I guess 2^32 is apply to a column. Am I right?

I think what you got is right. Arrow format supports two string dtypes, string and large_string, distinguished by the size of index/offset (int32 for string, int64 for large_string). In the schema that you showed in a previous comment, the data source uses string dtype, so a column cannot handle a data larger than 2^31 bytes(2GB). And it also means you may first cast the string columns (especially, the INFO big column) as large_string and then consume with pl.from_arrow.

antonylebechec commented 3 months ago

Thanks @cjackal! I'll try to cast, or change my schema, before process it (d['INFO'].cast(pa.large_string())?). It could take a while... However, I do not know my schema before generating data. It depends on input data, columns are fluctuating...