Open antonylebechec opened 3 months ago
Does the issue occur on the latest version of polars? It looks like you're using polars==0.20.8
, but the latest version as of writing is polars==0.20.20
. There have been a lot of changes in the past twelve minor versions...
Hi,
Yes, I used the last version of polars==0.20.20
. I printed another conda environment. My bad.
This is now the log/err with the last version:
thread 'polars-5' panicked at crates/polars-arrow/src/compute/cast/utf8_to.rs:112:14:
max string/binary length exceeded: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/bin/howard", line 8, in <module>
sys.exit(main())
File "/Users/lebechea/BIOINFO/git/HOWARD/howard/main.py", line 273, in main
eval(f"{command_function}(args)")
File "<string>", line 1, in <module>
File "/Users/lebechea/BIOINFO/git/HOWARD/howard/tools/annotation.py", line 70, in annotation
vcfdata_obj.export_output()
File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/variants.py", line 2045, in export_output
database.export(
File "/Users/lebechea/BIOINFO/git/HOWARD/howard/objects/database.py", line 2643, in export
pl.from_arrow(d).write_csv(
File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/lib/python3.10/site-packages/polars/convert.py", line 434, in from_arrow
arrow_to_pydf(
File "/usr/local/Caskroom/miniconda/base/envs/howard_devel_polarsupdate/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py", line 1121, in arrow_to_pydf
pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
pyo3_runtime.PanicException: max string/binary length exceeded: TryFromIntError(())
I have to detail that I process huge files. The issue is probably due to resources. Especially, I splitted data and I do not have the crash (so it's probably not a problem of data content).
Oh! I forgot to say thank you for developing polars!!! It's so great and useful! 👍
If it can help, here is the schema of the dataframe:
d.schema=
#CHROM: string
POS: int32
ID: string
REF: string
ALT: string
QUAL: string
FILTER: string
INFO: string
It seems that error comes from pl.from_arrow
so write_csv is never reached.
(you may want to re-title the issue for better visibility)
It may need @ritchie46's attention.
We have a maximum string length of 2^32 bytes. That is, a single string element can hold maximum 4GB of data.
Thanks for your reply. Well, the full pyarrow dataframe is possibly huge, but never include a string of 4GB. Moreover, this full dataframe is chunked to write it by batch.
I tried to write directly from pyarrow, and it returns a segmentation fault
write_options = pa.csv.WriteOptions(
include_header=header,
delimiter=export_options.get("delimiter", ""),
quoting_style="none",
)
pa.csv.write_csv(d, f, write_options=write_options)
It seems they didn't expect that large strings either. :sweat_smile:
So, the full pyarrow dataframe is around 72MB (1,000,000 entries), with an INFO column as a Big String for some entries. But it seems that it's far from max 4GB. I chunked (100,000 entries) and it works. It solves my script, but it's really strange that it's fail for a dataframe not so big...
Can you provide a runnable example? (along with the chunked version that works)
Unfortunately no. I reproduced the issue only with a huge database (parquet file) of 150Gb, which is not easily sharable. Moreover, I use a code included in a complexe project. I don't known how to easily extract this part.
I can basically explain my code:
# Create a pyarrow dataframe record batch, with a query on a huge database (let's say 10,000,000).
conn = duckdb.connect()
query = "SELECT * FROM read_parquet('huge.parquet')"
chunk_size = 1000000 # **reduce/increase the chunk size to success or fail**
df = conn.execute(query).fetch_record_batch(chunk_size)
# For each chunk dataframe
for d in df:
# Open file
with open("my_file.tsv", mode="a") as f:
# Polars write dataframe
pl.from_arrow(d).write_csv(
file=f,
separator="\t",
include_header=False,
quote_style="never",
)
So, the full pyarrow dataframe is around 72MB (1,000,000 entries), with an INFO column as a Big String for some entries. But it seems that it's far from max 4GB.
Are you certain of that? It triggers a panic in Polars that only occurs on strings with a length of 2^32
. The fact that pyarrow segfaults also seems supicious to me.
I mean 72MB max for a value (row/column) in the full pyarrow dataframe. I generated a very big String by concatenating multiple columns, and repeating strings, to obtain finally a 72GB string.
I'm not sure that a column (especially INFO big String) is lower than 4GB for all rows (1,000,000). Is that you mean by 2^32
? Is it the length of a column or a row or a value?
Basically, if I chunk (100,000 instead of 1,000,000), it works. So I guess 2^32
is apply to a column. Am I right?
I mean 72MB max for a value (row/column) in the full pyarrow dataframe. I generated a very big String by concatenating multiple columns, and repeating strings, to obtain finally a 72GB string. I'm not sure that a column (especially INFO big String) is lower than 4GB for all rows (1,000,000). Is that you mean by
2^32
? Is it the length of a column or a row or a value? Basically, if I chunk (100,000 instead of 1,000,000), it works. So I guess2^32
is apply to a column. Am I right?
I think what you got is right. Arrow format supports two string dtypes, string
and large_string
, distinguished by the size of index/offset (int32 for string
, int64 for large_string
). In the schema that you showed in a previous comment, the data source uses string
dtype, so a column cannot handle a data larger than 2^31
bytes(2GB). And it also means you may first cast the string columns (especially, the INFO
big column) as large_string
and then consume with pl.from_arrow
.
Thanks @cjackal!
I'll try to cast, or change my schema, before process it (d['INFO'].cast(pa.large_string())
?). It could take a while...
However, I do not know my schema before generating data. It depends on input data, columns are fluctuating...
Checks
Reproducible example
Log output
Issue description
This python line usually works, but fail probably depending on input data (dataframe). As no explanation is provided for this issue, I'm not able to go deeper...
Expected behavior
A file written...
Installed versions