mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Favoring custom encoder over feasible derivation through nested case classes #247

Closed vreuter closed 2 years ago

vreuter commented 2 years ago

Intro I just ran into an issue trying to use a custom encoder I'd made for a type of one of the fields of a case class. When I declared the component type as an ordinary class rather than case class, though, my encoder (seemingly, anyway) was "picked up" and used in the derivation of the encoder for the type of the case class that aggregates the type for which I'd written the encoder.

Question I'm happy to describe in a bit more detail and / or produce a small example of the scenario, but I'm wondering if this is already a known issue, or if there's a hypothesis about what I could have done besides switching from case class to class to have the encoder used in the derived encoder. Note: I defined the encoder for the subfield's type in the companion object (of the subfield type), and then tried a wildcard import from that companion object that the use site of ParquetWriter, where I was using ParquetWriter.of[MyAggregateCaseClass].build(...)

Further description More specifically, in my bigger/"outer" case class that represents the records I'm encoding, I have--among other fields--two fields that are themselves case classes. One wraps a single field--a String--directly, and the other wraps a refinement type based on String. The refinement one was encoded fine (at least by checking the schema of output that was generated), but the type for the column corresponding to the simpler one (just a case class wrapping plain String), was a struct, despite that fact that I'd written the RequiredValueEncoder and TypedSchemaDef for the respective types exactly analogously, and with analogous placement (in the respective companion objects). It seems like the recursive derivation through case classes and supported types maybe was leading to my encoder for the simpler field being bypassed, but the inability to infer an encoder for the refined type forced that encoder to be picked up in the derivation?

mjakubowski84 commented 2 years ago

It looks like that compiler was unable to find a the encoder for your case class and picked up the default product encoder that creates records (structs). It is not know issue and it is hard to tell why it happens for you. Can you please provide a code snipped and the version of Scala that you use?

vreuter commented 2 years ago

Hi @mjakubowski84 thanks a lot for the reply. Sure, I'll put together a minimal (or at least reduced) example and post here once I do. I'm using Scala 3.1.0 and parquet4s 2.1.0

vreuter commented 2 years ago

Hi @mjakubowski84 I've now created https://github.com/vreuter/parq4s_encode as a tiny repo to demo this issue. When that project is sbt run, there will be 2 output files -- base.out.pqt and case.out.pqt, corresponding to use of ordinary class versus case class, respectively. Then when I check the schema, I can see the difference, e.g. using pyarrow (sorry, wasn't sure how to easily display schema with parquet4s):

~/code/parq4s_encode$ ipython
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow.parquet as pqt

In [2]: tab1 = pqt.read_table("base.out.pqt")

In [3]: tab1.schema
Out[3]: 
key: int32 not null
  -- field metadata --
  PARQUET:field_id: '1'
study: string not null
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
MadeBy: 'https://github.com/mjakubowski84/parquet4s'

In [4]: tab2 = pqt.read_table("case.out.pqt")

In [5]: tab2.schema
Out[5]: 
key: int32 not null
  -- field metadata --
  PARQUET:field_id: '1'
study: struct<get: string>
  child 0, get: string
    -- field metadata --
    PARQUET:field_id: '3'
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
MadeBy: 'https://github.com/mjakubowski84/parquet4s'
vreuter commented 2 years ago

@mjakubowski84 please let me know if you have a chance to try out the tiny demo repo and run into any trouble, or if there's any more context I should provide about the issue. As in that demo repo's build.sbt, this is with Scala 3.1.0 and 2.1.0 of this library.

mjakubowski84 commented 2 years ago

Thank you @vreuter. That is very helpful. I might not have time to do investigation before the weekend, but I should have something next week.

mjakubowski84 commented 2 years ago

Oh, that was a challenging bug to fix! Making it work the same in all supported Scala version and without breaking the API was a nightmare. But I think that I nailed it.

But as I said, it was a nasty bug, and the reason of it were missing tests. Custom types have been a feature of Parquet4s from the very beginning. But in the meaning both the library and Scala evolved and the feature stopped to work correctly.

vreuter commented 2 years ago

Awesome! Thanks a ton @mjakubowski84