segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

definitionLevel redux #279

Closed vc42 closed 2 years ago

vc42 commented 2 years ago

If we want to support a slice of pointers, we need to fix excessive in this case def level due to the fact that it's incremented twice, as a slice member and as a pointer(optional):

test case:

//
package main

import (
    "bytes"
    "fmt"
    //"github.com/davecgh/go-spew/spew"
    //"io"

    "github.com/segmentio/parquet-go"
)

type T2 struct {
    Id        int    `parquet:",plain,optional"`
    Name      string `parquet:",plain,optional"`
    //Timestamp string `json:"timestamp,omitempty" parquet:"timestamp,plain,timestamp,optional"`
    //Date      string `json:"timestamp,omitempty" parquet:"date,plain,date,optional"`
}

type T1 struct {
    TA []*T2
}

type T struct {
    T1 *T1
}

const nRows = 1

func main() {

    t := T{
        T1: &T1{
            TA: []*T2{&T2{
                Id: 43,
                Name:      "john",
                //Timestamp: "2022-04-12 19:51:52",
                //Date:      "2022-04-12",
            },
            },
        },
    }

    buffer := new(bytes.Buffer)

    rows := make([]T, nRows)
    for i := 0; i < nRows; i++ {
        t.T1.TA[0].Id = i+40
        rows[i] = t
    }

    fmt.Printf("bad:\n")
    gw := parquet.NewGenericWriter[T](buffer)
    _, err := gw.Write(rows)
    if err != nil {
        panic(err)
    }
    gw.Close()

    fmt.Printf("good:\n")
    w := parquet.NewWriter(buffer)
    for i := 0; i < nRows; i++ {
        err := w.Write(&t)
        if err != nil {
            panic(err)
        }
    }
    w.Close()

    //r := bytes.NewReader(buffer.Bytes())
    //rows2, err := parquet.Read[T](r, int64(buffer.Len()))
    //if err != io.EOF {
    //  panic(err)
    //}
    //spew.Dump(rows2)
}

Output: bad:

col: T1.TA.Id max_def 3 def [4]
col: T1.TA.Name max_def 3 def [4]

good:

col: T1.TA.Id max_def 3 def [3]
col: T1.TA.Name max_def 3 def [3]

Possible fix (not to increment if the element is a pointer, not sure if there's a better way):

diff --git a/column_buffer_go18.go b/column_buffer_go18.go
index 5f5d7f7..f0cb8d6 100644
--- a/column_buffer_go18.go
+++ b/column_buffer_go18.go
@@ -250,7 +250,9 @@ func writeRowsFuncOfSlice(t reflect.Type, schema *Schema, path columnPath) write
                        elemLevels := levels
                        if a.Len() > 0 {
                                b = a.Slice(0, 1)
-                               elemLevels.definitionLevel++
+                               if elemType.Kind() != reflect.Pointer {
+                                       elemLevels.definitionLevel++
+                               }
                        }

                        if err := writeRows(columns, b, elemLevels); err != nil {
vc42 commented 2 years ago

Again formatting is broken. Suggestion:

Replace:

elemLevels.definitionLevel++

with:

                if elemType.Kind() != reflect.Pointer {
                    elemLevels.definitionLevel++
                }