mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Why reading a lot of partitioning files is slow #358

Closed egorsmth closed 1 month ago

egorsmth commented 2 months ago

I got directory of partitioning parquet files, 60 folders with 1400 files each. Files are small, each file 100kb. To read them, even by partition of one folder, takes so long is goes beyond timeout on my server (several minutes).

Another case is partitioning by 60 folders with 10 files per folder of 6mb each, takes 3 minutes.

Why is it so slow? Does it goes through each file besides I only need one partition? and are there any way to make it faster?

partition by bigger files solves problem, but in my situation I can't do it because of another service in department. I guess I need to look into some database or maybe apache hudi solution, but may be I could solve it somehow with parquet.

mjakubowski84 commented 2 months ago

You do not mention what kind of file system you are using, but it seems to be a typical problem with file-based storage.

  1. Your files are too small
    • In order to benefit from Parquet at all, files must be big enough. At least 100 - 200MB. Check the docs https://parquet.apache.org/ and read about the concepts.
  2. There are too many partitions
    • Too granular partitioning leads to too many small files
  3. In effect, there are too many storage objects (directories and files) compared to the amount of data you want to read.
    • Listing and fetching metadata of storage objects is usually an expensive operation (more than reading the content of the file!). This becomes a prominent issue in the case of a distributed file system. Considering the fact that information about each object can be stored on a different machine or even in a different data center listing a huge file tree can take many minutes.

What you can do:

  1. Make your files bigger by:
    • having fewer partitions and relying more on filtering by Parquet metadata (during reading)
    • introducing a compaction job which periodically merges many small files into fewer bigger ones
  2. Look into the settings of your files system (server & client). Solutions like GCS or S3 have options to make IO ops faster (e.g. for more money)
  3. File storage is supposed to serve as a cold storage for huge amounts of data. Your use case seems to be the opposite: you require fast(er) reads of tiny files. Indeed, some kind of database might be a better option.
mjakubowski84 commented 2 months ago

To answer the question Does it go through each file besides I only need one partition?

If you specifiy a filter which matches a value of the Hive partition then the content of other partitions is neither listed nor read. This functionality was improved in https://github.com/mjakubowski84/parquet4s/releases/tag/v2.17.0

egorsmth commented 2 months ago

Thank you for response, I am trying to debug it, because right now it can't read large number of small files. But at least it could read small number of small files as fast as bigger parquet file. I will let you know if I find any thing. Unfortunately in my case I should handle a lot of small parquet files.

egorsmth commented 1 month ago

seems like issue was in my code