segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Different types returned in same scenario by Writer and Buffer #312

Closed asubiotto closed 2 years ago

asubiotto commented 2 years ago

In a simple scenario where I create a schema with a single bytes column, read the written page and attempt to create a dictionary from the observed type in the column index, I will get a panic depending on which type of writer I used to write the data. My expectation is that regardless of the writer used, I will observe the same behavior.

Here is a test illustrating what I mean:

func TestWriter(t *testing.T) {
    node := parquet.String()
    node = parquet.Encoded(node, &parquet.RLEDictionary)
    g := parquet.Group{}
    g["mystring"] = node
    schema := parquet.NewSchema("test", g)

    rows := []parquet.Row{[]parquet.Value{parquet.ValueOf("hello").Level(0, 0, 0)}}

    var storage bytes.Buffer

    tests := []struct {
        name        string
        getRowGroup func(t *testing.T) parquet.RowGroup
    }{
        {
            name: "NormalWriter",
            getRowGroup: func(t *testing.T) parquet.RowGroup {
                t.Helper()

                w := parquet.NewWriter(&storage, schema)
                _, err := w.WriteRows(rows)
                require.NoError(t, err)
                require.NoError(t, w.Close())

                r := bytes.NewReader(storage.Bytes())
                f, err := parquet.OpenFile(r, int64(storage.Len()))
                require.NoError(t, err)
                return f.RowGroups()[0]
            },
        },
        {
            name: "Buffer",
            getRowGroup: func(t *testing.T) parquet.RowGroup {
                t.Helper()

                b := parquet.NewBuffer(schema)
                _, err := b.WriteRows(rows)
                require.NoError(t, err)
                return b
            },
        },
    }

    for _, testCase := range tests {
        t.Run(testCase.name, func(t *testing.T) {
            rowGroup := testCase.getRowGroup(t)

            chunk := rowGroup.ColumnChunks()[0]
            idx := chunk.ColumnIndex()
            val := idx.MinValue(0)
            columnType := chunk.Type()

            _ = columnType.NewDictionary(0, 1, columnType.NewValues(val.Bytes(), []uint32{0, uint32(len(val.Bytes()))}))
        })
    }
}

In the Buffer case, the test panics because NewValues creates int32 values since columnType is actually an indexedType (https://github.com/segmentio/parquet-go/blob/7efc157d28afda607e07e1f003e3c2c6922932df/dictionary.go#L1229), not a vanilla stringType as is the case with the Writer. However, NewDictionary passes through to the underlying string type, causing a type assertion error (again, only in the Buffer case):

panic: cannot convert values of type INT32 to type BYTE_ARRAY [recovered]
        panic: cannot convert values of type INT32 to type BYTE_ARRAY

goroutine 21 [running]:
testing.tRunner.func1.2({0x10524bfc0, 0x140002f8310})
        /opt/homebrew/opt/go/libexec/src/testing/testing.go:1396 +0x1c8
testing.tRunner.func1()
        /opt/homebrew/opt/go/libexec/src/testing/testing.go:1399 +0x378
panic({0x10524bfc0, 0x140002f8310})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:884 +0x204
github.com/segmentio/parquet-go/encoding.(*Values).assertKind(...)
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/parquet-go@v0.0.0-20220802221544-d84ed320251d/encoding/values.go:56
github.com/segmentio/parquet-go/encoding.(*Values).ByteArray(0x14000080d28?)
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/parquet-go@v0.0.0-20220802221544-d84ed320251d/encoding/values.go:109 +0xb4
github.com/segmentio/parquet-go.newByteArrayDictionary({0x1053250c0?, 0x1057d2778}, 0x0, 0x0?, {0x2, 0x0, {0x1400019ff80, 0x5, 0x8}, {0x0, ...}})
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/parquet-go@v0.0.0-20220802221544-d84ed320251d/dictionary.go:680 +0x50
github.com/segmentio/parquet-go.(*stringType).NewDictionary(0x0?, 0x0, 0x1, {0x2, 0x0, {0x1400019ff80, 0x5, 0x8}, {0x0, 0x0, ...}})
        /Users/asubiotto/go/pkg/mod/github.com/segmentio/parquet-go@v0.0.0-20220802221544-d84ed320251d/type.go:915 +0xb8
Pryz commented 2 years ago

Hi @asubiotto ! Thanks for opening this issue. I believe you have the right expectation and we should ensure the same behavior between the two type of writers.

Let us know if that is something you want to help us fix ! Otherwise we will try to work on it soon (next week if I have to guess).

asubiotto commented 2 years ago

Hi @Pryz, definitely happy to take a stab at it since it should be a good way to learn more about the parquet library (thanks for writing + maintaining btw). What would be most helpful to me would be to: 1) Understand the why behind indexedType. Why is it used and only in the Buffer case? Is it used anywhere else that I should be aware of? This is to help me understand whether any modifications I make here are within spec/desired behavior. 2) Where it makes sense to put a regression test for this bug. 3) If this issue is straightforward for you, what fix you would write for this bug based on your understanding of what's going on.

Thanks!

asubiotto commented 2 years ago

Friendly ping