jerolba / parquet-carpet

Java Parquet serialization and deserialization library using Java 17 Records
Apache License 2.0
35 stars 0 forks source link

Carpet for DataFrame to Parquet (de)serialization? #33

Open andrus opened 1 week ago

andrus commented 1 week ago

Hi @jerolba . First, wanted to thank you for the great series of blog posts on Parquet and Java! This is the best piece of information on the topic I found so far.

I have a question about the Carpet project... We (the DFLib Java DataFrame project) are looking to serialize/deserialize DataFrames to/from Parquet, and were planning to use parquet-avro (especially, since we already support compiling DataFrames to Avro schemas). At the same time, Carpet looks appealing, as it uses the "normal" java.io classes and appears to have better performance. But it only targets a specific scenario - schemas based on Java records. So I wonder if it can be easily extended to dynamic schemas (so we can accommodate DataFrames with arbitrary columns), or is it even a scenario you'd care about?

Even if the answer is "no", I see a lot of value in Carpet to guide the DFLib's own implementation (Hadoop excludes, java.io adapters, etc.), but I wanted to explore all the possibilities 🙂

jerolba commented 1 week ago

Hi @andrus. Thanks for your feedback. I struggled a lot trying to work with and understand Parquet, and I felt that better documentation was needed 🙂 .

Carpet library supports to read a Parquet file into Maps:

List<Map> data = new CarpetReader<>(new File("my_file.parquet"), Map.class).toList();

I think that this feature can help you to implement easily the deserialization, but not the serialization. Carpet doesn't support to serialize a collection of maps to Parquet because it doesn''t know the schema in advance (and I don't want to reinvent a schema specification)

Does DFLib support nested data structures? Most of the code of Carpet deals with the logic about Record mapping consistency and recursive data structures (other records, maps and collections). If your DataFrame is by the moment a flat table, I think that it's easier to create your own implementation that wraps parquet-mr library than trying to adapt Carpet to your requirements.

Let me do a spike to see how can I implement, based on DFLib interfaces, a new implementation that reads/writes a parquet file with int, double and String values. With the insights, I can give you more feedback about the challenges or even create a PR to start with.

andrus commented 1 week ago

Does DFLib support nested data structures?

For most use cases it is flat, but since DataFrame columns can hold any objects, recently I got some requests for nesting support on the side of Avro. Some folks have suggested they may have cases when they'd even put DataFrames in column cells of a DataFrame. We may be looking in that direction there.

If your DataFrame is by the moment a flat table, I think that it's easier to create your own implementation that wraps parquet-mr library than trying to adapt Carpet to your requirements.

Yeah, flat would be a good start. We already have flat schema support in the Avro serializer - https://github.com/dflib/dflib/blob/main/dflib-avro/src/main/java/org/dflib/avro/schema/AvroSchemaCompiler.java (that may get extended to nested structures eventually). Maybe it will be of use with Parquet as well.

Let me do a spike to see how can I implement, based on DFLib interfaces, a new implementation that reads/writes a parquet file with int, double and String values. With the insights, I can give you more feedback about the challenges or even create a PR to start with.

Much appreciated!