Open ghuls opened 2 years ago
I have double checked that we don't accidentally create a memory buffer in python, but that is not the case. We open a Rust file handler and dispatch to arrow2.
I've skimmed a bit through the source of IPC writing and I can tell we write the data to an in memory buffer. I think we must expose a chunk_size
argument to the IPC writer so that we can influence how much memory is used before written.
When writing with pyarrow i can see the memory usage going up and down with 1G (or rare cases 2G) when it is writing to a Feather file.
Yeap, arrow2 currently has an intermediary write buffer. I have been trying to fix it but haven't been able yet. Good to know that pyarrow does not use it - it means that it is possible :p
@jorgecarleitao it might have a "small" intermediate buffer of 1G (or at least I see allocations of 1G and deallocations of 1GB. Also arrow2 generates compressed IPC files that pyarrow can't read (arrow2 itself can read it fine) (end of first post).
I am trying to write Feather files from a dataframe that is 135G, so it would be nice that writing it to a file does not require another 135G.
Yeap, I am also investigating that one. It seems that pyarrow has more requirements than simply "zstd" or "Lz4" encoding, but because the arrow project has no integration tests on these, we can't prove roundtrip.
I am working on the apache/arrow directly to try to improve this situation.
I agree that we should not require an extra buffer here.
For LZ4 the go implementation has this comment: https://github.com/apache/arrow/blob/bcf3d3e5a2ae5e70034b104ce69f774b78bbb4de/go/arrow/ipc/compression.go#L65-L80
arrow-rs hit the same "Lz4 compressed input contains more than one frame" problem: https://github.com/apache/arrow/pull/9137
I finally found the root cause! Fixed in https://github.com/jorgecarleitao/arrow2/pull/840
@jorgecarleitao thanks a lot of all bugfixes lately.
But it looks like it still isn't fixed completely (file created with polars,to_ipc(.., compression="lz4"):
In [5]: import pyarrow.feather as pf
In [6]: %time a = pf.read_table('tests.v2_lz4.feather')
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<timed exec> in <module>
/staging/leuven/stg_00002/lcb/ghuls/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map, use_threads)
246
247 if columns is None:
--> 248 return reader.read()
249
250 column_types = [type(column) for column in columns]
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/_feather.pyx in pyarrow._feather.FeatherReader.read()
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Buffer 6 did not start on 8-byte aligned offset: 3963187
In [7]: %time df.to_ipc('test.v2_zstd.feather', 'zstd')
CPU times: user 1min 54s, sys: 16.2 s, total: 2min 10s
Wall time: 3min
In [8]: %time a = pf.read_table('test.v2_zstd.feather')
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<timed exec> in <module>
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map, use_threads)
246
247 if columns is None:
--> 248 return reader.read()
249
250 column_types = [type(column) for column in columns]
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/_feather.pyx in pyarrow._feather.FeatherReader.read()
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Buffer 4 did not start on 8-byte aligned offset: 513398
In [9]: pl.__version__
Out[9]: '0.13.2'
Do you have a minimal example? asking because afai understand pyarrow is writing un-aligned offsets, but apparently it can still read them. Thus, I am misunderstanding the Arrow spec here.
At the moment I don't have a minimal example (file is 32GB) but I can try to reproduce it with a smaller file.
Small feather files created by arrow2 (also uncompressed in case you want to generate a compressed one with pyarrow) test.feather_v2.zip :
In [19]: df_head100.to_ipc("test.lz4_v2.feather", compression="lz4")
In [20]: df_head100.to_ipc("test.zstd_v2.feather", compression="zstd")
In [21]: a = pf.read_table("test.lz4_v2.feather")
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-21-ebc079c6397a> in <module>
----> 1 a = pf.read_table("test.lz4_v2.feather")
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map, use_threads)
246
247 if columns is None:
--> 248 return reader.read()
249
250 column_types = [type(column) for column in columns]
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/_feather.pyx in pyarrow._feather.FeatherReader.read()
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Buffer 4 did not start on 8-byte aligned offset: 242
In [22]: a = pf.read_table("test.zstd_v2.feather")
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-22-456a8d615b60> in <module>
----> 1 a = pf.read_table("test.zstd_v2.feather")
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map, use_threads)
246
247 if columns is None:
--> 248 return reader.read()
249
250 column_types = [type(column) for column in columns]
software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/_feather.pyx in pyarrow._feather.FeatherReader.read()
/software/miniconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Buffer 4 did not start on 8-byte aligned offset: 213
In [23]: df_head100.to_ipc("test.uncompressed_v2.feather", compression="uncompressed")
@ritchie46 Could you update arrow2 when you make a new release?
Closing this as https://github.com/jorgecarleitao/arrow2/pull/840 fixes the issue, which is made available in the arrow2 release v0.10.0 (see https://github.com/jorgecarleitao/arrow2/releases), which polars in turn has incorporated with release 0.20.0 (https://github.com/pola-rs/polars/pull/2888).
@zundertj The original issue is not fixed (writing whole file to memory first).
My apologies, I thought it was fixed given this conversation and the releases.
Filed upstream: https://github.com/jorgecarleitao/arrow2/issues/928
Are you using Python or Rust?
Python.
What version of polars are you using?
0.13.0
What operating system are you using polars on?
CentOS 7
Describe your bug.
Writing to an IPC file, first seems to write to an intermediate buffer.
What are the steps to reproduce the behavior?
@jorgecarleitao It seems that arrow2 creates Feather files that pyarrow (I used pyarrow 7.0.0) does not like (if compressed with lz4 or zstd):