segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

AWS Athena's `=` not working for string #500

Closed mrasu closed 1 year ago

mrasu commented 1 year ago

Hello, When I made a parquet file and uploaded to S3 and run in Athena with following query, Athena returns nothing even it exists in reality.

select id_str from dummy1 where id_str = '1064572a7aa5d023a0455b956668b179'

Is there any way to handle it?

Detail

I made a parquet file with below code.

func main() {
    type segmentio struct {
        Name  string `parquet:"name"`
        IDStr string `parquet:"id_str"`
    }
    ss := []segmentio{
        {Name: "segmentio", IDStr: "1064572a7aa5d023a0455b956668b179"},
    }

    if err := parquet.WriteFile("p_segmentio.bin", ss); err != nil {
        panic(err)
    }
}

And I made a table in AWS Athena (trino compatibl) and uploaded the made file to S3.

CREATE EXTERNAL TABLE `dummy1`(
  `name` string, 
  `id_str` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://${BUCKET}/aaa'

Then, I wanted the where id_str = '1064572a7aa5d023a0455b956668b179' SQL matches one record but it didn't.

xitongsys/parquet-go works

For investigation, I did the same thing with https://github.com/xitongsys/parquet-go with below code,


func main() {
    type xitongsys struct {
        Name  string `parquet:"name=name, type=BYTE_ARRAY, convertedtype=UTF8"`
        IDStr string `parquet:"name=id_str, type=BYTE_ARRAY, convertedtype=UTF8"`
    }

    fw, err := local.NewLocalFileWriter("p_xitongsys.bin")
    if err != nil {
        panic(err)
    }
    pw, err := writer.NewParquetWriter(fw, new(xitongsys), 4)
    if err != nil {
        panic(err)
    }
    xx := xitongsys{Name: "xitongsys", IDStr: "1064572a7aa5d023a0455b956668b179"}
    if err = pw.Write(xx); err != nil {
        panic(err)
    }
    if err = pw.WriteStop(); err != nil {
        panic(err)
    }
    fw.Close()
}

After I uploaded it to the same directory and run the same query, I could get one record. So, I think this is not the problem in Athena but in parquet file.

paquet-cli

I compared metadata of the two files with parquet-mr/parquet-cli and there is difference, I hope it helps you see my situation.

For segmentio/parquet-go,

File path:  p_segmentio.bin
Created by: github.com/segmentio/parquet-go version 0.0.0(build e1109e2f6a20)
Properties: (none)
Schema:
message segmentio {
  required binary name (STRING);
  required binary id_str (STRING);
}

Row group 0:  count: 1  107.00 B records  start: 4  total(compressed): 107 B total(uncompressed):107 B 
--------------------------------------------------------------------------------
        type      encodings count     avg size   nulls   min / max
name    BINARY    _         1         42.00 B            
id_str  BINARY    _         1         65.00 B   

For xitongsys/parquet-go,

File path:  p_xitongsys.bin
Created by: parquet-go version latest
Properties: (none)
Schema:
message parquet_go_root {
  required binary name (STRING) = 0;
  required binary id_str (STRING) = 0;
}

Row group 0:  count: 1  275.00 B records  start: 4  total(compressed): 275 B total(uncompressed):271 B 
--------------------------------------------------------------------------------
        type      encodings count     avg size   nulls   min / max
name    BINARY    S BB_     1         80.00 B    0       "xitongsys" / "xitongsys"
id_str  BINARY    S BB_     1         195.00 B   0       "1064572a7aa5d023a0455b956..." / "1064572a7aa5d023a0455b956..."

Thank you in advance.

bartleyg commented 1 year ago

@mrasu I tried to reproduce this with the latest commit (05ed5ed000b1) and the Athena query was successful for me. I see your file was written with version e1109e2f6a20 of parquet-go which is several versions behind. I recommend updating your go.mod to the latest version of parquet-go (05ed5ed000b1), rewriting and uploading the file to S3, recreating the table, and trying the query again.

I also did another experiment adding ,zstd" to the parquet struct tags, and queried that table fine with Athena too.

Here's the meta output of my file reproducing yours but with the latest build:

$ parquet meta p_segmentio.bin 

File path:  p_segmentio.bin
Created by: github.com/segmentio/parquet-go version 0.0.0(build 05ed5ed000b1)
Properties: (none)
Schema:
message segmentio {
  required binary name (STRING);
  required binary id_str (STRING);
}

Row group 0:  count: 1  107.00 B records  start: 4  total(compressed): 107 B total(uncompressed):107 B 
--------------------------------------------------------------------------------
        type      encodings count     avg size   nulls   min / max
name    BINARY    _         1         42.00 B            "segmentio" / "segmentio"
id_str  BINARY    _         1         65.00 B            "1064572a7aa5d023a0455b956..." / "1064572a7aa5d023a0455b956..."
mrasu commented 1 year ago

@bartleyg After upgrading to the latest version, it works. Thank you!

bartleyg commented 1 year ago

@mrasu awesome! 🎉