tlabs-data / tablesaw-parquet

Parquet IO for Tablesaw
Apache License 2.0
11 stars 1 forks source link

Reading row groups #83

Open tischi opened 2 weeks ago

tischi commented 2 weeks ago

Hi,

Thanks a lot for developing and maintaining this super useful library!

I was wondering about reading "row groups"; is that possible?

The way I understand this, this should allow one to lazy-load a subset of rows from a large data set?

ccleva commented 1 week ago

Hi @tischi, thanks for your kind words!

As far as I know it is not possible directly: row groups are an internal parquet structure used in predicate pushdown. Predicate pushdown is not yet implemented (and there is no plan to do it in the near future).

If your use case is to read only the first N rows we could add that feature in the options, let me know if this would fit your needs.

tischi commented 1 week ago

Hi @ccleva,

Let me provide you with give some context:

Currently, I am not using Parquet yet, but exploring the options.

My main application, ideally, would be lazy loading from a big S3 hosted table into a Java JTable, such that data is only loaded while people move around in the JTable UI, both lazy-loading rows and columns on-demand from the object store. It would become part of this Java application: https://www.nature.com/articles/s41592-023-01776-4 There, we already can lazy-load very well the image data, using a S3 Zarr backend, however the tables are currently stored as CSV on github an must be downloaded as a whole (and we use TableSaw and JTable for reading and visualising tables).

Do you know whether such lazy-loading of rows and columns from Parquet/S3 into a JTable should, in principle, be possible?

ccleva commented 1 week ago

Thanks for the context, I understand better.

I don't think it would work out of the box with just parquet files on s3 and tablesaw: parquet files metadata is stored after the data and needs to be read first, so remote files are actually fully loaded locally before reading.

In the usual architecture there would be a middleware system (spark, dremio, AWS Athena, etc.) that would allow to efficiently query data stored in parquet and return the requested subset to your local client.

You could (and should) partition the data for faster access, but you will need to implement the partition access logic yourself then.

Having said that, reading data from parquet files will be faster than reading from csv files, in particular for remote files due to the much smaller file size.

Hope this helps, let me know if you need more information. And kudos for MoBIE !

tischi commented 1 week ago

Thank you!

That's a bit of a bummer ;-)

I actually hoped that Parquet would "natively" support chunked loading and writing (from S3...).

I googled a bit and it seems that the python pyarrow library does support both chunked loading and saving chunks?!

https://blog.clairvoyantsoft.com/efficient-processing-of-parquet-files-in-chunks-using-pyarrow-b315cc0c62f9

For the saving, as you mentioned, there are actually now a number of files, but the library seems to help quite a bit with the creation of those chunk files, so that would be fine (for me).

And for the loading, I would hope that parquet_file.read_row_group(i, i + chunk_size) would be smart enough to only touch the files that contain relevant data?

What do you think? Is it that the parquet file format in principle would support chunking, but maybe the current Java implementations don't have all the features?

ccleva commented 1 week ago

Yes, pyarrow supports chunking, partitioning, predicate pushdown, and reading from s3. There are some nice examples here.

Other libraries might also provide this (like awswrangler), but it is not a "native" feature of parquet, more a feature of the library reading and writing the files.

tischi commented 1 week ago

I see, and there is no Java equivalent to pyarrow...?

ccleva commented 1 week ago

The java arrow implementation (arrows is a memory data format) includes support for reading parquet files but some of the features for manipulating the data afterwards are not pure Java implementations.

This is why I started this project: I wanted a pure Java implementation of a dataset (tablesaw is kind of a pandas equivalent) that can be used with parquet files...