Closed egorsmth closed 1 month ago
You do not mention what kind of file system you are using, but it seems to be a typical problem with file-based storage.
What you can do:
To answer the question Does it go through each file besides I only need one partition?
If you specifiy a filter which matches a value of the Hive partition then the content of other partitions is neither listed nor read. This functionality was improved in https://github.com/mjakubowski84/parquet4s/releases/tag/v2.17.0
Thank you for response, I am trying to debug it, because right now it can't read large number of small files. But at least it could read small number of small files as fast as bigger parquet file. I will let you know if I find any thing. Unfortunately in my case I should handle a lot of small parquet files.
seems like issue was in my code
I got directory of partitioning parquet files, 60 folders with 1400 files each. Files are small, each file 100kb. To read them, even by partition of one folder, takes so long is goes beyond timeout on my server (several minutes).
Another case is partitioning by 60 folders with 10 files per folder of 6mb each, takes 3 minutes.
Why is it so slow? Does it goes through each file besides I only need one partition? and are there any way to make it faster?
partition by bigger files solves problem, but in my situation I can't do it because of another service in department. I guess I need to look into some database or maybe apache hudi solution, but may be I could solve it somehow with parquet.