hyparam / hyparquet

parquet file parser for javascript
MIT License
181 stars 4 forks source link

More efficient row filtering #20

Open andynsd opened 1 month ago

andynsd commented 1 month ago

I am using rowStart and rowEnd to filter rows which works as advertised, but I am seeing some performance problems. It looks like the library is assembling all of the data from relevant row groups and then slicing off the undesired portion after the fact. If I just want a single row but my row group size is relatively high (i.e. 1 GB), the heap size still gets very large. There doesn't seem much benefit to using rowStart or rowEnd.

Looking through the code , it seems like the library could avoid holding onto the rows that fall outside of the requested row window. Does this problem resonate at all? I wonder if there are any plans to make this more efficient. I might be able to get some bandwidth to help with a fix if it seems doable/useful.

platypii commented 1 month ago

This is absolutely something that I would like to see improved! There is already a rowLimit parameter to the readColumn function which helps to stop parsing early if not all the rows are needed. But agree that it could be improved.

One thing to be careful of is that raw column data may have a different length than the actual row start and end, because it gets assembled into lists and structs. That being said, I'm pretty sure that clever tricks could save significantly on heap size.

Contributions are most welcome! Happy to further discuss strategies here too.