Closed asubiotto closed 2 years ago
Hi @asubiotto ! Thanks for opening this issue. I believe you have the right expectation and we should ensure the same behavior between the two type of writers.
Let us know if that is something you want to help us fix ! Otherwise we will try to work on it soon (next week if I have to guess).
Hi @Pryz, definitely happy to take a stab at it since it should be a good way to learn more about the parquet library (thanks for writing + maintaining btw). What would be most helpful to me would be to:
1) Understand the why behind indexedType
. Why is it used and only in the Buffer
case? Is it used anywhere else that I should be aware of? This is to help me understand whether any modifications I make here are within spec/desired behavior.
2) Where it makes sense to put a regression test for this bug.
3) If this issue is straightforward for you, what fix you would write for this bug based on your understanding of what's going on.
Thanks!
Friendly ping
In a simple scenario where I create a schema with a single bytes column, read the written page and attempt to create a dictionary from the observed type in the column index, I will get a panic depending on which type of writer I used to write the data. My expectation is that regardless of the writer used, I will observe the same behavior.
Here is a test illustrating what I mean:
In the
Buffer
case, the test panics becauseNewValues
creates int32 values sincecolumnType
is actually anindexedType
(https://github.com/segmentio/parquet-go/blob/7efc157d28afda607e07e1f003e3c2c6922932df/dictionary.go#L1229), not a vanillastringType
as is the case with theWriter
. However,NewDictionary
passes through to the underlying string type, causing a type assertion error (again, only in theBuffer
case):