segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Write Structs to ColumnBuffer #385

Open hhoughgg opened 2 years ago

hhoughgg commented 2 years ago
type ParquetTestSchemaObj struct {
    Field1      string
    Field2      int
    Nested      Nested
    NestedSlice []Nested
}
type Nested struct {
    Name string
    Age  int
}

nestedCol := []Nested{{"jimmy", 50}, {"billy", 10}, {"bobby", 99}, {"tommy", 78}}
strCol := []string{"a", "b", "c", "d"}
intCol := int[1, 2, 3, 4]

strColByteArray := make([]byte, 8)

ps := parquet.SchemaOf(ParquetTestSchemaObj{})
b := parquet.NewBuffer(ps)
pCols := b.ColumnBuffers()

if _, err := pCols[0].(parquet.ByteArrayWriter).WriteByteArrays(ConvertSliceStringToParquetByteArray(strCol)); err != nil {
    return err
}

I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.

Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!

achille-roussel commented 2 years ago

Hello @hhoughgg, thanks for starting this conversation!

Would this code snippet be helpful to highlight how to write struct values to a parquet buffer?

rows := []ParquetTestSchemaObj{
  ...
}

buffer := parquet.NewGenericBuffer[ParquetTestSchemaObj]()
buffer.Write(rows)
hhoughgg commented 2 years ago

Edit: Are you suggesting to use multiple buffers? Perhaps I can just pull the columns out of each one and concat them into a final buffer? The struct type is build with reflection so in my case I won't have the actual type for generics.

Ah sorry I think my explanation was poor. I have some structs I want to write as parquet group type (multiple columns) along with other columns that are just for example int64 etc. I have some custom types that can be written directly and others that cannot such as MyCustomColumn3.

The below columns would all end up as one parquet file

MyCustomColumn1 []int64 MyCustomColumn2 []string MyCustomColumn3 []struct{ A string B int }

I would have a parquet column buffer after schemaOf of 4 columns but I only have three columns in this case. Is there some existing function to convert MyCustomColumn3 into the two columns I need to write? My assumption was that the row/struct writer has some deconstruction columns that will convert []struct to multiple columns.