Closed yorsita closed 2 years ago
Hello @yorsita
I ran the code you provided and used the standard parquet-tools CLI from https://github.com/apache/parquet-mr and it appears to be able to read the file generated by parquet-go:
$ parquet-tools dump /tmp/file.parquet
row group 0
--------------------------------------------------------------------------------
FirstName: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:58/58/1.00 VC:2 ENC:DELT [more]...
LastName: BINARY UNCOMPRESSED DO:0 FPO:62 SZ:37/37/1.00 VC:2 ENC:DEL [more]...
FirstName TV=2 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats [more]... VC:2
LastName TV=2 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats [more]... VC:2
BINARY FirstName
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 V:Bob
value 2: R:0 D:0 V:Alice
BINARY LastName
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 V:
value 2: R:0 D:0 V:
I tested with the python program you referenced and it appears that it does not support the standard DELTA_LENGTH_BYTE_ARRAY encoding:
parquet-tools show /tmp/file.parquet
Traceback (most recent call last):
File "/home/achille_roussel/.local/bin/parquet-tools", line 8, in <module>
sys.exit(main())
File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/cli.py", line 26, in main
args.handler(args)
File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/show.py", line 59, in _cli
with get_datafame_from_objs(pfs, args.head) as df:
File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
return next(self.gen)
File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/utils.py", line 190, in get_datafame_from_objs
df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
File "/usr/lib/python3.9/contextlib.py", line 429, in enter_context
result = _cm_type.__enter__(cm)
File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
return next(self.gen)
File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/utils.py", line 71, in get_dataframe
yield pq.read_table(local_path).to_pandas()
File "/home/achille_roussel/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/achille_roussel/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.
This issue should be raised with the maintainers of the Python parquet-tools program I believe.
Hi Achillem @achille-roussel thanks so much for the help, will note that and use the tool you mentioned!
BTW, @achille-roussel I am writing data into parquet file for several times. In certain cases I want to append the data to an existing parquet file. I was wondering is currently a way to do that? or alternatively can I read the parquet file in buffer then append data in the end and flush it to the same file?
The parquet format does not provide strong on how to implement appending data to an existing parquet files. Regenerating a new file would be the simplest way to achieve this indeed.
I left more details in https://github.com/segmentio/parquet-go/issues/349 which you may find useful.
I will close this issue since I believe the original question has been addressed. If you have follow ups about appending data to a parquet files feel free to continue the conversion in the issue linked above 👍
Thanks so much
Hi team, I was trying to use the below sample followed by the README to generate parquet file. I could successfully generate one but failed to read it using parquet-tools-python-version. I'm new to parquet could you please tell me is there any tools I can use to read the parquet file generated by
segmentio/parquet-go
? Thanks so much.