segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Enable backwards compatible reads on maps #415

Open parsnips opened 1 year ago

parsnips commented 1 year ago

I have a parquet file that i used the PrintSchema on and one of the maps it prints the following schema snippet:

message Foo {
// Other fields

         optional group new {
        repeated group map {
            required binary key (STRING);
            optional group value {
                optional binary b (STRING);
                optional binary n (STRING);
            }
        }
    }
}

Which appears to be a deprecated form of a map (outputted from AWS kinesis firehose parquet transformer).... I tried to model this using the following HasMap:

type AV struct {
    B string `parquet:"b,optional"`
    N string `parquet:"n,optional"`
}

type HasMap struct {
    New map[string]AV `parquet:"new,"`
}

However when I read the rows from my parquet file, the golang maps are empty. Printing the schema for the struct, I noticed it was significantly different:

message HasMap {
                required group new (MAP) {
                    repeated group key_value {
                        required binary key (STRING);
                        required group value {
                            optional binary b (STRING);
                            optional binary n (STRING);
                        }
                    }
                }
            }

Any advice to build structs that work with repeated group map?

parsnips commented 1 year ago

Failing test here https://github.com/parsnips/parquet-go/commit/7aff1dac1e6cd5037d352705847d9b0bb2369aef

I've stepped through with a debugger a number of times and it looks like reconstructFuncOfMap is never called even though it's setup on the schema.

When hitting those map columns it seems to always get set to zero value:

https://github.com/segmentio/parquet-go/blob/bcdc4570dd49ca4e1f2da368ee338a7b6ed9a2f4/row.go#L559-L562

parsnips commented 1 year ago

Note this doesn't work with parquet with the (MAP) annotations either:

 optional group new (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      optional group value {
        optional binary b (UTF8);
        optional binary n (UTF8);
      }
    }
  }