More efficient row filtering

hyparam / hyparquet

parquet file parser for javascript

MIT License

181 stars 4 forks source link

I am using rowStart and rowEnd to filter rows which works as advertised, but I am seeing some performance problems. It looks like the library is assembling all of the data from relevant row groups and then slicing off the undesired portion after the fact. If I just want a single row but my row group size is relatively high (i.e. 1 GB), the heap size still gets very large. There doesn't seem much benefit to using rowStart or rowEnd.

Looking through the code , it seems like the library could avoid holding onto the rows that fall outside of the requested row window. Does this problem resonate at all? I wonder if there are any plans to make this more efficient. I might be able to get some bandwidth to help with a fix if it seems doable/useful.

hyparam / hyparquet

More efficient row filtering #20