segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Optional Repeated Elements Panics on Deconstruct #324

Closed abraithwaite closed 1 year ago

abraithwaite commented 2 years ago

Simple scenario where deconstruct/reconstruct panics unexpectedly. This should be a valid configuration according to the Parquet spec, but our implementation does not support optional repeated elements strongly yet.

--- FAIL: TestNestedDeconstrcutReconstruct (0.00s)
    --- FAIL: TestNestedDeconstrcutReconstruct/handle_nested_repeated_elements (0.00s)
panic: cannot create parquet value of type BYTE_ARRAY from go value of type []string [recovered]
        panic: cannot create parquet value of type BYTE_ARRAY from go value of type []string

goroutine 21 [running]:
testing.tRunner.func1.2({0x102c0d140, 0x140000a03c0})
        /opt/homebrew/Cellar/go/1.19/libexec/src/testing/testing.go:1396 +0x1c8
testing.tRunner.func1()
        /opt/homebrew/Cellar/go/1.19/libexec/src/testing/testing.go:1399 +0x378
panic({0x102c0d140, 0x140000a03c0})
        /opt/homebrew/Cellar/go/1.19/libexec/src/runtime/panic.go:884 +0x204
github.com/segmentio/parquet-go.makeValue(0x6?, {0x102c05980?, 0x140000b2100?, 0x102cd8cc0?})
        /Users/abraithwaite/Projects/parquet-go/value.go:267 +0x920
github.com/segmentio/parquet-go.deconstructFuncOfLeaf.func1({0x140000a6600, 0x2, 0x2}, {0x88?, 0xc8?, 0xae?}, {0x102c05980?, 0x140000b2100?, 0x102b041a2?})
        /Users/abraithwaite/Projects/parquet-go/row.go:495 +0x68
github.com/segmentio/parquet-go.deconstructFuncOfOptional.func1({0x140000a6600?, 0x2?, 0x2?}, {0x8?, 0x1?, 0xb?}, {0x102c05980?, 0x140000b2100?, 0x14000119cd8?})
        /Users/abraithwaite/Projects/parquet-go/row.go:395 +0xdc
github.com/segmentio/parquet-go.deconstructFuncOfGroup.func1({0x140000b0108?, 0x140000a63d0?, 0xd9bb1a010c810607?}, {0x0?, 0x80?, 0xd?}, {0x102c77660?, 0x140000b20f0?, 0x102c773e0?})
        /Users/abraithwaite/Projects/parquet-go/row.go:473 +0x12c
github.com/segmentio/parquet-go.deconstructFuncOfRepeated.func1({0x140000b0108?, 0x1, 0x1}, {0xa0?, 0x0, 0x0}, {0x102c03600?, 0x140000a63d0?, 0x102c7bee0?})
        /Users/abraithwaite/Projects/parquet-go/row.go:411 +0x194
github.com/segmentio/parquet-go.deconstructFuncOfGroup.func1({0x0?, 0x140000a4180?, 0x102be4619?}, {0x88?, 0xfc?, 0xcf?}, {0x102c773e0?, 0x140000a63c0?, 0x102cffc88?})
        /Users/abraithwaite/Projects/parquet-go/row.go:473 +0x12c
github.com/segmentio/parquet-go.(*Schema).Deconstruct(0x140000a4180, {0x0?, 0x0, 0x0}, {0x102c773e0?, 0x140000a63c0?})
        /Users/abraithwaite/Projects/parquet-go/schema.go:232 +0x18c
github.com/segmentio/parquet-go_test.TestNestedDeconstrcutReconstruct.func1(0x14000092820)
        /Users/abraithwaite/Projects/parquet-go/row_test.go:62 +0x58
testing.tRunner(0x14000092820, 0x140000a0240)
        /opt/homebrew/Cellar/go/1.19/libexec/src/testing/testing.go:1446 +0x10c
created by testing.(*T).Run
        /opt/homebrew/Cellar/go/1.19/libexec/src/testing/testing.go:1493 +0x300
exit status 2
FAIL    github.com/segmentio/parquet-go 0.659s
abraithwaite commented 2 years ago

Some progress debugging this. I know the root cause is due to the optional tag, and I can see the way the node changes when that's enabled as demonstrated here:

(dlv) p name
"phoneNumbers"
(dlv) call node.GoType().Elem().Kind()
> github.com/segmentio/parquet-go.makeNodeOf() ./schema.go:857 (PC: 0x102f08e5c)
Values returned:
        ~r0: String (24)

   852:                         panic("unhandled nested slice on parquet schema without list tag")
   853:                 }
   854:         }
   855:
   856:         if optional {
=> 857:                 node = Optional(node)
   858:         }
   859:
   860:         return node
   861: }
   862:
(dlv) n
> github.com/segmentio/parquet-go.makeNodeOf() ./schema.go:860 (PC: 0x102f08e78)
   855:
   856:         if optional {
   857:                 node = Optional(node)
   858:         }
   859:
=> 860:         return node
   861: }
   862:
   863: func forEachTagOption(tags []string, do func(option, args string)) {
   864:         for _, tag := range tags {
   865:                 _, tag = split(tag) // skip the field name
(dlv) call node.GoType().Elem().Kind()
> github.com/segmentio/parquet-go.makeNodeOf() ./schema.go:860 (PC: 0x102f08e78)
Values returned:
        ~r0: Slice (23)

   855:
   856:         if optional {
   857:                 node = Optional(node)
   858:         }
   859:
=> 860:         return node
   861: }

That Kind is ultimately what drives the logic in the Construct functions in makeValue, which is where the panic is being thrown:

https://github.com/segmentio/parquet-go/blob/5bd5f6114638a749b9326aaf4ea5a6ea90cc9cf4/value.go#L203-L206 ... skipped code, same function: ... https://github.com/segmentio/parquet-go/blob/5bd5f6114638a749b9326aaf4ea5a6ea90cc9cf4/value.go#L260-L268

Ultimately, we want v.Type() to be string here, not []string.

I get that Optional() wraps the GoType() in a reflect.PtrTo here, called from schema.go:857 during schema construction.

https://github.com/segmentio/parquet-go/blob/5bd5f6114638a749b9326aaf4ea5a6ea90cc9cf4/node.go#L157

However, removing that reflection wrapper in optional.GoType() doesn't seem to affect the behavior of the code. That is, we still see []string in makeValue when we're expecting just string.

I'm currently looking for where the value being passed into makeValue is being unwrapped and/or set in the first place.