segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Is it possible to create a Schema at runtime dynamically? #311

Closed slim-bean closed 2 years ago

slim-bean commented 2 years ago

I'd like to generate parquet files dynamically at runtime.

Example I have a []map[string]string and i'd like to turn the map keys into columns

For the array of maps I can come up with a consistent set of keys to become columns, then basically each map of my slice of maps becomes a row where I'll pull the values out and write them to a row.

I tried building a struct at runtime using a whole mess of reflect. code but this was kind of gnarly and also didn't work. (the SchemaOf methods also do a lot of reflection and I couldn't make anything work out of this)

The Schema type has a really limited set of constructors, I'm wondering thoughts on supporting this kind of functionality, perhaps through some new constructors for the Schema that let you set the column information?

Pryz commented 2 years ago

https://github.com/polarsignals/frostdb/tree/main/dynparquet might be what you are looking for ?

sdressler commented 2 years ago

To get a dynamic (simple) schema, I am currently using a snippet like this:

structFields := []reflect.StructField{}
for _, field := range fields {
    tag := fmt.Sprintf(`parquet:"%v,optional,plain"`, field.name)

    var tp reflect.Type
    switch field.type {
    case "int":
        x := int64(0)
        tp = reflect.TypeOf(&x)

    case "string":
        [...]
    }

    structFields = append(structFields, reflect.StructField{
        Name: strings.ToUpper(field.name),
        Type: tp,
        Tag:  reflect.StructTag(tag),
    })
}

structType := reflect.StructOf(structFields)
structElem = reflect.New(structType)

schema = parquet.SchemaOf(structElem.Interface())

HTH

slim-bean commented 2 years ago

Thanks for the helpful references @Pryz and @sdressler!

I was able to make this work!

I was really close before, I got tripped up by not sending the .Interface() value to the parquet.SchemaOf() method.