mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Efficent way to read big files? #347

Closed egorsmth closed 6 months ago

egorsmth commented 6 months ago

I need to read files in a paginated way. I tried 2 options:

1) parquetReader.iterator.slice(limit, limit + offset) 2) RecordFilter(index => index >= offset && index < offset + limit)

First option pretty fast in the beginning of the file and slows down when we move to end of file. Totally It is rather slow in my case. Second option reads each "page" in a consistent time, but each read rather slow compared with reads of first option in the beginning of file.

What is the right way to read big files?

mjakubowski84 commented 6 months ago

I am not sure what could be the reason of iterator + slice getting slower with time, especially that I do not know the rest of your code. Maybe you are loading the whole file into memory.

The second option can be quite slow in general, because you are opening a file each time.

In order to avoid memory issues and keep the high performance I recommend using a reactive solution that Parquet4S supports that is Akka, Pekko & FS2.

egorsmth commented 6 months ago

Yeap, I guess I have some problem with whole file loading each time. I will try fs2 thanks.