Open kylebarron opened 2 years ago
The JS / TypeScript version of the loader has not yet been optimized. The batches are read out row-by-row by a "row iterator" and then concatenated.
This can easily be made much faster. Probably a good WASM loader can be faster than JS but given the block memory loading model of parquet, I doubt perf differences would be significant between the two implementation. Instead
Overall JS may also have a smaller bundle size, but that can be less of an issue if the code is loaded dynamically.
I was trying to do a simple benchmark of the JS parquet library in
modules/parquet
. With this example Parquet file (1 million rows, 1 row group, no compression) I got aMaximum call stack size exceeded
error (traceback below).I figured this might have something to do with having 1 million rows in a single row group, so I tried the same file with 20 row groups (i.e. with 50,000 rows in each row group). This file worked, but took 29.949s; for comparison a benchmark with the same file using the wasm loader took around 62ms (both in Node v16.14.0).
Given these results, I'd like to get the wasm parquet loader in https://github.com/visgl/loaders.gl/pull/2103 cleaned up sometime soon.
I couldn't get the ParquetLoader to work in a standalone NPM project; even after installing polyfills I kept getting errors of
Blob is not defined
. The only way I could get the ParquetLoader to work is in the existing test cases, so I just modified one of the existing tests to load these new files:Call stack error when trying to load this Parquet file using the code: