visgl / loaders.gl

Loaders for big data visualization. Website:
https://loaders.gl
Other
709 stars 193 forks source link

Parquet `Maximum call stack size exceeded` error & simple wasm benchmark #2144

Open kylebarron opened 2 years ago

kylebarron commented 2 years ago

I was trying to do a simple benchmark of the JS parquet library in modules/parquet. With this example Parquet file (1 million rows, 1 row group, no compression) I got a Maximum call stack size exceeded error (traceback below).

I figured this might have something to do with having 1 million rows in a single row group, so I tried the same file with 20 row groups (i.e. with 50,000 rows in each row group). This file worked, but took 29.949s; for comparison a benchmark with the same file using the wasm loader took around 62ms (both in Node v16.14.0).

Given these results, I'd like to get the wasm parquet loader in https://github.com/visgl/loaders.gl/pull/2103 cleaned up sometime soon.

I couldn't get the ParquetLoader to work in a standalone NPM project; even after installing polyfills I kept getting errors of Blob is not defined. The only way I could get the ParquetLoader to work is in the existing test cases, so I just modified one of the existing tests to load these new files:

test.only('load file', async (t) => {
  const url = '@loaders.gl/parquet/test/data/20-partition-none.parquet';
  console.time('load Parquet');
  const data = await load(url, ParquetLoader, {parquet: {url}, worker: false});
  console.timeEnd('load Parquet');

  t.equal(data.length, 1000000);
  t.end();
});

Call stack error when trying to load this Parquet file using the code:

test.only('load file', async (t) => {
  const url = '@loaders.gl/parquet/test/data/1-partition-none.parquet';
  const data = await load(url, ParquetLoader, {parquet: {url}, worker: false});
  t.end();
});
not ok 1 RangeError: Maximum call stack size exceeded
  ---
    operator: error
    expected: |-
      undefined
    actual: |-
      [RangeError: Maximum call stack size exceeded]
    at: bound (/Users/kyle/github/mapping/loaders.gl/node_modules/onetime/index.js:30:12)
    stack: |-
      RangeError: Maximum call stack size exceeded
          at Object.decodeValues (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/codecs/rle.ts:95:14)
          at decodeValues (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:216:35)
          at decodeDataPage (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:276:15)
          at decodePage (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:105:20)
          at decodeDataPages (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:58:24)
          at ParquetEnvelopeReader.readColumnChunk (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-envelope-reader.ts:140:18)
          at processTicksAndRejections (node:internal/process/task_queues:96:5)
          at ParquetEnvelopeReader.readRowGroup (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-envelope-reader.ts:81:43)
          at ParquetCursor.next (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-cursor.ts:48:25)
          at parseParquetFileInBatches (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/lib/parse-parquet.ts:20:22)
  ...
ibgreen commented 2 years ago

The JS / TypeScript version of the loader has not yet been optimized. The batches are read out row-by-row by a "row iterator" and then concatenated.

This can easily be made much faster. Probably a good WASM loader can be faster than JS but given the block memory loading model of parquet, I doubt perf differences would be significant between the two implementation. Instead

Overall JS may also have a smaller bundle size, but that can be less of an issue if the code is loaded dynamically.