segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Can not read the Parquet file generated by the sample code in README using parquet-tools #367

Closed yorsita closed 2 years ago

yorsita commented 2 years ago

Hi team, I was trying to use the below sample followed by the README to generate parquet file. I could successfully generate one but failed to read it using parquet-tools-python-version. I'm new to parquet could you please tell me is there any tools I can use to read the parquet file generated by segmentio/parquet-go ? Thanks so much.

package main

import (
        "log"
    "github.com/segmentio/parquet-go"
)

func main() {
    type RowType struct{ FirstName, LastName string }

    if err := parquet.WriteFile("file.parquet", []RowType{
        {FirstName: "Bob"},
        {FirstName: "Alice"},
    }); err != nil {
        log.Println(err)
    }
}
achille-roussel commented 2 years ago

Hello @yorsita

I ran the code you provided and used the standard parquet-tools CLI from https://github.com/apache/parquet-mr and it appears to be able to read the file generated by parquet-go:

$ parquet-tools dump /tmp/file.parquet
row group 0
--------------------------------------------------------------------------------
FirstName:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:58/58/1.00 VC:2 ENC:DELT [more]...
LastName:   BINARY UNCOMPRESSED DO:0 FPO:62 SZ:37/37/1.00 VC:2 ENC:DEL [more]...

    FirstName TV=2 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:2

    LastName TV=2 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:RLE VLE:DELTA_LENGTH_BYTE_ARRAY ST:[no stats  [more]... VC:2

BINARY FirstName
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 V:Bob
value 2: R:0 D:0 V:Alice

BINARY LastName
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 V:
value 2: R:0 D:0 V:

I tested with the python program you referenced and it appears that it does not support the standard DELTA_LENGTH_BYTE_ARRAY encoding:

parquet-tools show /tmp/file.parquet
Traceback (most recent call last):
  File "/home/achille_roussel/.local/bin/parquet-tools", line 8, in <module>
    sys.exit(main())
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/cli.py", line 26, in main
    args.handler(args)
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/show.py", line 59, in _cli
    with get_datafame_from_objs(pfs, args.head) as df:
  File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/utils.py", line 190, in get_datafame_from_objs
    df: Optional[pd.DataFrame] = stack.enter_context(pf.get_dataframe())
  File "/usr/lib/python3.9/contextlib.py", line 429, in enter_context
    result = _cm_type.__enter__(cm)
  File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/parquet_tools/commands/utils.py", line 71, in get_dataframe
    yield pq.read_table(local_path).to_pandas()
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/achille_roussel/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

This issue should be raised with the maintainers of the Python parquet-tools program I believe.

yorsita commented 2 years ago

Hi Achillem @achille-roussel thanks so much for the help, will note that and use the tool you mentioned!

yorsita commented 2 years ago

BTW, @achille-roussel I am writing data into parquet file for several times. In certain cases I want to append the data to an existing parquet file. I was wondering is currently a way to do that? or alternatively can I read the parquet file in buffer then append data in the end and flush it to the same file?

achille-roussel commented 2 years ago

The parquet format does not provide strong on how to implement appending data to an existing parquet files. Regenerating a new file would be the simplest way to achieve this indeed.

I left more details in https://github.com/segmentio/parquet-go/issues/349 which you may find useful.

I will close this issue since I believe the original question has been addressed. If you have follow ups about appending data to a parquet files feel free to continue the conversion in the issue linked above 👍

yorsita commented 2 years ago

Thanks so much