jerolba / parquet-carpet

Java Parquet serialization and deserialization library using Java 17 Records
Apache License 2.0
50 stars 3 forks source link

Proposal for new functionality - Annotation to set a null marker value for numeric types #37

Closed javafanboy closed 2 months ago

javafanboy commented 2 months ago

In many cases it would be valuable to specify also numeric attributes (= Parquet columns) as Nullable in an efficient way.

Since Carpet uses a "record" to ingest data one way to specify that a numeric value should be nullable would be an annotation that also could specify one value that should be used to represent "NULL". This will not work in all situations but in quite a lot of cases one msy for instance know that only non-negative values will be used and then -1 could be specified to represent "NULL". When the marker is found "carpet" would enter NULL into the Parquet file.

Another way (already used by Carpet)is using the wrapper classes Integer, Short etc. but these are expensive to create (allocated on heap for starters) and are not prefered for performance sensitive use-cases and a marker annotation would therefore be preferred...

jerolba commented 2 months ago

I have implemented Carpet to make it as fast as possible, but it has a problem: it uses reflection to access Java record components.

When using reflection to deal with primitive values, you always get the boxed instance. I couldn't find a way to avoid this without generating bytecode. I found that using method handles is the fastest way to perform reflection, but in any case, you still end up with a wrapped instance.

As a result, in Carpet, there is no performance difference between using primitive or boxed values, because at some point, Carpet uses a boxed instance (which is immediately discarded and collected by the GC in the young generation).

In my tests, Carpet's serialization performance is better than that of Parquet Avro and Parquet Protobuf:

If you find a faster way to serialize Parquet primitive data in Java, we can compare it with Carpet and assess the margin for improvement if I manage to work with primitives directly.

javafanboy commented 2 months ago

Ok good call-out - I can then use that without further performance loss to mark up a few additional columns I would like to make nullable.

And yes the only way to make Carpet radically faster while keeping it as "user friendly" would be dynamic byte code generation of the serialization based on the classes with something like ASM https://asm.ow2.io/ etc. - probably not a "trivial" thing to implement...

Maybe Parquet generation in itself is so slow that the extra overhead of reflection here is minimal so not worth the trouble to make it more complex to increase performance further....

On Sun, Aug 18, 2024 at 8:56 PM Jeronimo López @.***> wrote:

I have implemented Carpet to make it as fast as possible, but it has a problem: it uses reflection to access Java record components.

When using reflection to deal with primitive values, you always get the boxed instance. I couldn't find a way to avoid this without generating bytecode. I found that using method handles is the fastest way to perform reflection, but in any case, you still end up with a wrapped instance.

As a result, in Carpet, there is no performance difference between using primitive or boxed values, because at some point, Carpet uses a boxed instance (which is immediately discarded and collected by the GC in the young generation).

In my tests https://www.jeronimo.dev/working-with-parquet-files-in-java-using-carpet/#performance, Carpet's serialization performance is better than that of Parquet Avro and Parquet Protobuf:

  • Parquet Avro: 15,381 ms
  • Parquet Protocol Buffers: 16,174 ms
  • Carpet: 12,769 ms

If you find a faster way to serialize Parquet primitive data in Java, we can compare it with Carpet and assess the margin for improvement if I manage to work with primitives directly.

— Reply to this email directly, view it on GitHub https://github.com/jerolba/parquet-carpet/issues/37#issuecomment-2295356608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQF5EU2H6GDIZE6Q2AWLZSDU7RAVCNFSM6AAAAABMWPCIASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJVGM2TMNRQHA . You are receiving this because you authored the thread.Message ID: @.***>