parquet-go / parquet-go

High-performance Go package to read and write Parquet files
https://pkg.go.dev/github.com/parquet-go/parquet-go
Apache License 2.0
250 stars 41 forks source link

Writing structs with embedded pointer types causes nils to be converted to zero-value #163

Open mikemherron opened 3 weeks ago

mikemherron commented 3 weeks ago

I think there may be an issue when writing structs with embedded pointer values. It seems like on write (and then read) any embedded pointers will have values of nil replaced with the zero-value for the pointer type (I've only tested with int and string but assume this to be the case with other types). Pointers that are defined on the top-level struct don't seem to have this issue. While trying to debug this, I also noticed that the struct passed to writer.Write actually has the embedded pointer values mutated to the zero-value, which I'm assuming is not intended.

Both issues can be seen in the below repro:

package main

import (
    "bytes"
    "fmt"
    "log"

    "github.com/parquet-go/parquet-go"
    "github.com/parquet-go/parquet-go/compress/zstd"
)

type someStruct struct {
    embeddedStruct
    InlineIntPointer    *int
    InlineStringPointer *string
    Name                string
}

type embeddedStruct struct {
    // The bug seems to be when parquet-go serialises the embedded struct, the pointers
    // are set to the zero-value for the pointer type, instead of nil.
    EmbeddedIntPointer    *int
    EmbeddedStringPointer *string
}

func (p someStruct) String() string {
    return fmt.Sprintf(` %s: 
  inline int pointer: %s
  inline string pointer: %s
  embedded int pointer: %s
  embedded string pointer: %s`,
        p.Name,
        intPtrVal(p.InlineIntPointer),
        strPtrVal(p.InlineStringPointer),
        intPtrVal(p.EmbeddedIntPointer),
        strPtrVal(p.EmbeddedStringPointer))
}

func main() {
    values := []*someStruct{
        {
            embeddedStruct: embeddedStruct{
                EmbeddedIntPointer:    nil,
                EmbeddedStringPointer: nil,
            },
            InlineIntPointer:    nil,
            InlineStringPointer: nil,
            Name:                "Both should be nil",
        },
        {
            embeddedStruct: embeddedStruct{
                EmbeddedIntPointer:    intPtr(12),
                EmbeddedStringPointer: strPtr("12"),
            },
            InlineIntPointer:    intPtr(12),
            InlineStringPointer: strPtr("12"),
            Name:                "Both should be 12",
        },
        {
            embeddedStruct: embeddedStruct{
                EmbeddedIntPointer:    intPtr(0),
                EmbeddedStringPointer: strPtr("0"),
            },
            InlineIntPointer:    intPtr(0),
            InlineStringPointer: strPtr("0"),
            Name:                "Both should be 0",
        },
    }

    fmt.Println("Original values:")
    for _, v := range values {
        fmt.Println(v.String())
    }

    buff := &bytes.Buffer{}
    writer := parquet.NewWriter(buff, parquet.Compression(&zstd.Codec{}))
    for _, row := range values {
        if err := writer.Write(row); err != nil {
            log.Fatalf("unable to write parquet row: %v", err)
        }
    }

    err := writer.Close()
    if err != nil {
        log.Fatalf("unable to close parquet writer: %v", err)
    }

    fmt.Println("\nValues after using writer.Write:")
    for _, v := range values {
        fmt.Println(v.String())
    }

    bytesReader := bytes.NewReader(buff.Bytes())
    reader := parquet.NewGenericReader[someStruct](bytesReader)
    valuesRead := make([]someStruct, reader.NumRows())
    _, err = reader.Read(valuesRead)
    if err != nil {
        log.Fatalf("unable to read parquet rows: %v", err)
    }

    fmt.Println("\nValues read from reader.Read:")
    for _, p := range valuesRead {
        fmt.Println(p.String())
    }
}

func intPtr(i int) *int {
    return &i
}

func strPtr(s string) *string {
    return &s
}

func intPtrVal(i *int) string {
    if i == nil {
        return "nil"
    }
    return fmt.Sprintf("%d", *i)
}

func strPtrVal(s *string) string {
    if s == nil {
        return "nil"
    }
    return fmt.Sprintf("\"%s\"", *s)
}

This outputs:

Original values:
 Both should be nil: 
  inline int pointer: nil
  inline string pointer: nil
  embedded int pointer: nil
  embedded string pointer: nil
 Both should be 12: 
  inline int pointer: 12
  inline string pointer: "12"
  embedded int pointer: 12
  embedded string pointer: "12"
 Both should be 0: 
  inline int pointer: 0
  inline string pointer: "0"
  embedded int pointer: 0
  embedded string pointer: "0"

Values after using writer.Write:
 Both should be nil: 
  inline int pointer: nil
  inline string pointer: nil
  embedded int pointer: 0
  embedded string pointer: ""
 Both should be 12: 
  inline int pointer: 12
  inline string pointer: "12"
  embedded int pointer: 12
  embedded string pointer: "12"
 Both should be 0: 
  inline int pointer: 0
  inline string pointer: "0"
  embedded int pointer: 0
  embedded string pointer: "0"

Values read from reader.Read:
 Both should be nil: 
  inline int pointer: nil
  inline string pointer: nil
  embedded int pointer: 0
  embedded string pointer: ""
 Both should be 12: 
  inline int pointer: 12
  inline string pointer: "12"
  embedded int pointer: 12
  embedded string pointer: "12"
 Both should be 0: 
  inline int pointer: 0
  inline string pointer: "0"
  embedded int pointer: 0
  embedded string pointer: "0"

It's the "both should be nil" test case that is interesting here - you can see in the original output, both embedded pointers were nil. In the final output, they are 0 and "" (empty string). You can also see after performing the write, the embedded pointer values have changed already.

mikemherron commented 3 weeks ago

Updating the repro to use parquet.NewGenericWriter rather than parquet.NewWriter solves the issue of the struct values being changed after the write, but they still have the embedded nils changed to the zero-value on read.

mikemherron commented 3 weeks ago

I've created a test for the issue here, I'll look in to it a bit and see if I can make some progress towards a fix.

mikemherron commented 3 weeks ago

OK, I think the values are correctly written as nil, but turn back to the zero value on read. I traced it down to these lines in schema.go - if I comment out the following code my test case passes:

 func fieldByIndex(v reflect.Value, index []int) reflect.Value {
        for _, i := range index {
                if v = v.Field(i); v.Kind() == reflect.Ptr || v.Kind() == reflect.Interface {
-                       if v.IsNil() {
-                               v.Set(reflect.New(v.Type().Elem()))
-                               v = v.Elem()
-                               break
-                       } else {
-                               v = v.Elem()
-                       }
+                       //if v.IsNil() {
+                       //      v.Set(reflect.New(v.Type().Elem()))
+                       //      v = v.Elem()
+                       //      break
+                       //} else {
+                       //      v = v.Elem()
+                       //}
                }
        }
        return v

It looks like this code is intentionally setting the nil pointer to the zero value 🤔 Pointers on the top-level struct don't go down this code path at all, I think due to the if statement here. I don't have any prior knowledge of the code so I'm probably getting to the limit of what I can contribute towards a fix - I'll leave this for now and hopefully someone with more context can pick this up from here :)

achille-roussel commented 1 week ago

Hello @mikemherron

I would try to run the test suite with the code change you made, if none of the tests break then the behavior wasn't tested and it seems fine to change.

If some of the tests break, then it's worth digging a bit and figure out how to retain the behavior while fixing the bug.

I would encourage you to submit a pull request with your code snippet converted to a test so we can at least validate that the issue is being fixed and that we won't introduce regressions in the future.

Thanks a lot for your contribution!