luizperes / simdjson_nodejs

Node.js bindings for the simdjson project: "Parsing gigabytes of JSON per second"
https://arxiv.org/abs/1902.08318
Apache License 2.0
549 stars 25 forks source link

Expose the underlying tapes as ArrayBuffer #34

Open nojvek opened 4 years ago

nojvek commented 4 years ago

Reading the code I see the fast part of simdjson is parsing the bytes of json and creating two buffers/tapes. One is the json tape that marks starting, ending and types for various elements. The other is a string tape that contains the parsed strings in utf-8 format.

JavaScript offers a nice way of fast buffer indexing and getting our values via TypedArrays and ArrayBuffers. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays

This would mean that the iteration part of getting values out could be done in pure js. I.e it would be technically possible to stream the buffers as binary data to the browser and have the iteration of json part work there too.

Or one could dump the tapes as files and get zero cost parsing by simply mmaping a file and iterating over gigabyes of json tape like Flat buffers.

https://google.github.io/flatbuffers/

I also don’t think lazyParse as the only function is a great interface. Underlying simdjson has a concept of elements and iterators. JavaScript has similar concept of iterators too. One would need to resort to proxy hacks which are a bit too magical and sometimes. I think we can expose a much nicer object/array iterator based interface for underlying tape.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Iterators_and_Generators

https://codeburst.io/a-simple-guide-to-es6-iterators-in-javascript-with-examples-189d052c3d8e

This would mean there’s be two sub modules. One that is a fast jsonStr -> {jsonTape, strsTape}

The other that takes {jsonTape, strsTape} -> elemIterator.

Hopefully I’m making sense.

I’m happy to write the js part of the code. Just need to figure out how to export the buffers using napi api.

luizperes commented 4 years ago

Thanks @nojvek, that does make sense! I've been meaning to add the rest of the API (but failed to document that in an issue).

The iterator part, in my opinion, should be kept in the C++ side since simdjson already has an API for "JSON Pointer" and support for iterators (https://github.com/simdjson/simdjson/blob/master/doc/basics.md#json-pointer). If we did the same in the library, we would have to update (depending on the change it would need a re-write) our "internal" API every time the upstream updated its own API so to keep it up-to-date.

nojvek commented 4 years ago

Not opposed to the idea of using existing c++ interface.

I wrote a pure js tape dumper because I wanted to understand the underlying mechanics of simdjson. Some neat ideas.

const fs = require(`fs`);
const carsTapeBuffer = fs.readFileSync(`${__dirname}/tape.buffer`);
const carsStrBuffer = fs.readFileSync(`${__dirname}/str.buffer`);

const TapeType = {
  ROOT: 'r'.charCodeAt(0),
  START_ARRAY: '['.charCodeAt(0),
  START_OBJECT: '{'.charCodeAt(0),
  END_ARRAY: ']'.charCodeAt(0),
  END_OBJECT: '}'.charCodeAt(0),
  STRING: '"'.charCodeAt(0),
  INT64: 'l'.charCodeAt(0),
  UINT64: 'u'.charCodeAt(0),
  DOUBLE: 'd'.charCodeAt(0),
  TRUE_VALUE: 't'.charCodeAt(0),
  FALSE_VALUE: 'f'.charCodeAt(0),
  NULL_VALUE: 'n'.charCodeAt(0),
};

/**
 * @param {DataView} tapeBufView
 * @param {DataView} strBufView
 */
function dumpTape(tapeBufView, strBufView) {
  console.log(tapeBufView);
  console.log(strBufView);
  const size64 = 8 ; // sizeof(uint64_t)
  const size32 = 4;
  const textDecoder = new TextDecoder();

  for(let tapeIdx = 0, len = tapeBufView.byteLength; tapeIdx < len; tapeIdx += size64) {
    const elemType = tapeBufView.getUint8(tapeIdx + 7);
    switch (elemType) {
      case TapeType.ROOT:
      case TapeType.START_ARRAY:
      case TapeType.START_OBJECT:
      case TapeType.END_ARRAY:
      case TapeType.END_OBJECT: {
        const offset = tapeBufView.getUint32(tapeIdx, true)
        console.log(String.fromCharCode(elemType), offset);
        break;
      }
      case TapeType.TRUE_VALUE:
      case TapeType.FALSE_VALUE:
      case TapeType.NULL_VALUE: {
        console.log(String.fromCharCode(elemType));
        break;
      }
      case TapeType.STRING: {
        const strIdx = tapeBufView.getUint32(tapeIdx, true)
        const strLen = strBufView.getUint32(strIdx, true)
        const str = textDecoder.decode(new DataView(strBufView.buffer, strBufView.byteOffset + strIdx + size32, strLen));
        console.log(String.fromCharCode(elemType), str);
        break;
      }
      case TapeType.INT64: {
        tapeIdx += size64;
        const val = tapeBufView.getBigInt64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), val);
        break;
      }
      case TapeType.UINT64: {
        tapeIdx += size64;
        const elemVal = tapeBufView.getBigUint64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), elemVal);
        break;
      }
      case TapeType.DOUBLE: {
        tapeIdx += size64;
        const elemVal = tapeBufView.getFloat64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), elemVal);
        break;
      }
      default: {
        throw new Error(`unknown type ${elemType}, this should never happen`);
        break;
      }
    }
  }
}

dumpTape(
  new DataView(carsTapeBuffer.buffer, carsTapeBuffer.byteOffset, carsTapeBuffer.length),
  new DataView(carsStrBuffer.buffer, carsStrBuffer.byteOffset, carsStrBuffer.length)
);
luizperes commented 4 years ago

I see, with your example, I re-read what you wrote and that makes a lot of sense. So would it suffice if I exposed the two available tapes? (I only need to check if it is possible to access them within C++ without modifying the headers, preferably)

nojvek commented 4 years ago

That would be great if it's easily exposable without having a big perf impact.

luizperes commented 4 years ago

Hi @nojvek,

I implemented what you asked on the branch buffers. Creating ArrayBuffers seems to have some overhead on NApi and it actually doesn't improve, as we initially thought. Funny thing is: if I switch it with External (keeping a C++ pointer to it), we then see the good results (> 1 GB/s).

Do you know if there is a way of converting an external object into an array buffer?