Carpet for DataFrame to Parquet (de)serialization?

andrus commented 5 months ago

Hi @jerolba . First, wanted to thank you for the great series of blog posts on Parquet and Java! This is the best piece of information on the topic I found so far.

I have a question about the Carpet project... We (the DFLib Java DataFrame project) are looking to serialize/deserialize DataFrames to/from Parquet, and were planning to use parquet-avro (especially, since we already support compiling DataFrames to Avro schemas). At the same time, Carpet looks appealing, as it uses the "normal" java.io classes and appears to have better performance. But it only targets a specific scenario - schemas based on Java records. So I wonder if it can be easily extended to dynamic schemas (so we can accommodate DataFrames with arbitrary columns), or is it even a scenario you'd care about?

Even if the answer is "no", I see a lot of value in Carpet to guide the DFLib's own implementation (Hadoop excludes, java.io adapters, etc.), but I wanted to explore all the possibilities 🙂

jerolba commented 5 months ago

Hi @andrus. Thanks for your feedback. I struggled a lot trying to work with and understand Parquet, and I felt that better documentation was needed 🙂 .

Carpet library supports to read a Parquet file into Maps:

List<Map> data = new CarpetReader<>(new File("my_file.parquet"), Map.class).toList();

I think that this feature can help you to implement easily the deserialization, but not the serialization. Carpet doesn't support to serialize a collection of maps to Parquet because it doesn''t know the schema in advance (and I don't want to reinvent a schema specification)

Does DFLib support nested data structures? Most of the code of Carpet deals with the logic about Record mapping consistency and recursive data structures (other records, maps and collections). If your DataFrame is by the moment a flat table, I think that it's easier to create your own implementation that wraps parquet-mr library than trying to adapt Carpet to your requirements.

Let me do a spike to see how can I implement, based on DFLib interfaces, a new implementation that reads/writes a parquet file with int, double and String values. With the insights, I can give you more feedback about the challenges or even create a PR to start with.

andrus commented 5 months ago

Does DFLib support nested data structures?

For most use cases it is flat, but since DataFrame columns can hold any objects, recently I got some requests for nesting support on the side of Avro. Some folks have suggested they may have cases when they'd even put DataFrames in column cells of a DataFrame. We may be looking in that direction there.

If your DataFrame is by the moment a flat table, I think that it's easier to create your own implementation that wraps parquet-mr library than trying to adapt Carpet to your requirements.

Yeah, flat would be a good start. We already have flat schema support in the Avro serializer - https://github.com/dflib/dflib/blob/main/dflib-avro/src/main/java/org/dflib/avro/schema/AvroSchemaCompiler.java (that may get extended to nested structures eventually). Maybe it will be of use with Parquet as well.

Let me do a spike to see how can I implement, based on DFLib interfaces, a new implementation that reads/writes a parquet file with int, double and String values. With the insights, I can give you more feedback about the challenges or even create a PR to start with.

Much appreciated!

andrus commented 3 months ago

Hi @jerolba , wonder if you had any success with the experiment above? I was planning to dedicate some time to DFLib Parquet work in the next few weeks. Of course, it wouldn't be a problem to start from scratch, but didn't want any effort on your side to go to waste :)

jerolba commented 3 months ago

Hi @andrus

I implemented some code without using Carpet, (just parquet-mr library), but I left it half done because of vacations. Give me a couple of days to clean it and add some tests.

andrus commented 3 months ago

Great, looking forward! Yeah, after spending a few hours yesterday poking around Parquet code, I also feel like we should go low-level in DFLib, as we'll need to manually control all aspects of (de)serialization (schema reading, column selection, value conversions, etc.). I tried Parquet Java 1.14.1 (which presumably does not require Hadoop dependencies), but quickly ran into issues with missing classes .. from Hadoop. So I'll hold off to see your code.

Also noticed that the Parquet project says parquet-mr is now "Parquet Java" (https://github.com/apache/parquet-java). Which version of Parquet libs are you using?

jerolba commented 3 months ago

Yes, internally there is still a dependency with a Hadoop Configuration class that forces you to include it, even if you don't use it.

They changed recently the root project location name, but maven dependencies names have not changed.

I'm using version 1.14.1. To solve the Hadoop transitive dependency hell, I'm excluding all unneeded dependencies from pom.xml, as I did in Carpet.

jerolba commented 3 months ago

@andrus FYI: https://github.com/dflib/dflib/pull/315

We can continue the conversation in the PR. I tried to follow your code style.

jerolba / parquet-carpet

Carpet for DataFrame to Parquet (de)serialization? #33