Open hhoughgg opened 2 years ago
Hello @hhoughgg, thanks for starting this conversation!
Would this code snippet be helpful to highlight how to write struct values to a parquet buffer?
rows := []ParquetTestSchemaObj{
...
}
buffer := parquet.NewGenericBuffer[ParquetTestSchemaObj]()
buffer.Write(rows)
Edit: Are you suggesting to use multiple buffers? Perhaps I can just pull the columns out of each one and concat them into a final buffer? The struct type is build with reflection so in my case I won't have the actual type for generics.
Ah sorry I think my explanation was poor. I have some structs I want to write as parquet group type (multiple columns) along with other columns that are just for example int64 etc. I have some custom types that can be written directly and others that cannot such as MyCustomColumn3.
The below columns would all end up as one parquet file
MyCustomColumn1 []int64 MyCustomColumn2 []string MyCustomColumn3 []struct{ A string B int }
I would have a parquet column buffer after schemaOf of 4 columns but I only have three columns in this case. Is there some existing function to convert MyCustomColumn3 into the two columns I need to write? My assumption was that the row/struct writer has some deconstruction columns that will convert []struct to multiple columns.
I am trying to write parquet files column by column. I was before writing row by row using structs. For situations where there are nested structs I saw that the row write will run the deconstruct function to get []parquet.Value to build the nested rows. These functions all look to be private unless I am missing something? Right now my data looks like []Struct{} or [][]Struct for the column.
Does it make sense to have some of these public as in my use case I will have to convert to structs manually anyway since the data is like that already. It would be easier to just look for situations where the column is a struct and then convert to [][]Parquet.Value or something similar and write each one. Obviously the performance benefit is lost but seems ok when its only 1 column of say 20. Hopefully I am not misunderstanding how this works!