toddfarmer / arrow-migration

0 stars 1 forks source link

Python: Writing more than 2^31 rows from pandas dataframe causes row count overflow error #1392

Closed toddfarmer closed 7 years ago

toddfarmer commented 7 years ago

Note: This issue was originally created as ARROW-1446. Please see the migration documentation for further details.

Original Issue Description:

I have the following code:

import pyarrow
import pyarrow.parquet as pq

client = pyarrow.HdfsClient("<host>", <port>, "<user>", driver='libhdfs3')
abc_table = client.read_parquet('<source parquet>', nthreads=16)
abc_df = abc_table.to_pandas()
abc_table = pyarrow.Table.from_pandas(abc_df)
with client.open('<target parquet>', 'wb') as f:
    pq.write_table(abc_table, f)

contains 2497301128 rows.

During the write however I get the following error:

{format} Traceback (most recent call last): File "pyarrow_cluster.py", line 29, in main() File "pyarrow_cluster.py", line 26, in main pq.write_table(nmi_table, f) File "/miniconda2/envs/parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 796, in write_table writer.write_table(table, row_group_size=row_group_size) File "_parquet.pyx", line 663, in pyarrow._parquet.ParquetWriter.write_table File "error.pxi", line 72, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Written rows: -1797666168 != expected rows: 2497301128in the current column chunk {format}

The number of written rows specified suggests a 32-bit signed integer has overflowed.

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): Taking a look. [~xhochy] this could be a parquet-cpp bug so we should try to resolve before 1.3.0

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): PR: https://github.com/apache/arrow/pull/1055. The underlying cause is fixed by PARQUET-1090

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): Issue resolved by pull request 1055 https://github.com/apache/arrow/pull/1055