vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Exporting to arrow seems to create corrupted or invalid output #2228

Closed alvations closed 1 year ago

alvations commented 1 year ago

On these vaex and pyarrow version:

>>> vaex.__version__
{'vaex': '4.12.0',
 'vaex-core': '4.12.0',
 'vaex-viz': '0.5.3',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.1',
 'vaex-jupyter': '0.8.0',
 'vaex-ml': '0.18.0'}

>>> pyarrow.__version__
8.0.0

When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv:

$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv

And exporting the tsv to arrow as such, then reading it back:

import vaex
import pyarrow as pa

df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')

pa.parquet.read_table('s2t.parquet')

It throws the following error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
      1 import pyarrow as pa
      2 
----> 3 pa.parquet.read_table('s2t.parquet')

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
   2746                 ignore_prefixes=ignore_prefixes,
   2747                 pre_buffer=pre_buffer,
-> 2748                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   2749             )
   2750         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
   2338 
   2339             self._dataset = ds.FileSystemDataset(
-> 2340                 [fragment], schema=schema or fragment.physical_schema,
   2341                 format=parquet_format,
   2342                 filesystem=fragment.filesystem

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Is there some additional args/kwargs that should be added when exporting or reading the parquet files?

Or is the exporting to arrow bugged/broken somehow?

JovanVeljanoski commented 1 year ago

You are using the wrong method.

Basically you need to


df.export_parquet("file.parquet")

# or 

df.export("file.parquet") # This will auto-use the above method by looking at the extensions specified

This, df.export_arrow("file.arrow") exports to a different (arrow native) file format

alvations commented 1 year ago

Thanks for the quick reply! I get it the right write/read function and extensions now.