manzt / numcodecs.js

Buffer compression and transformation codecs for use in Zarr.js and beyond...
MIT License
27 stars 6 forks source link

Add Blosc #6

Closed manzt closed 4 years ago

manzt commented 4 years ago

I think this is really close to being complete. I compiled c-blosc to llvm blosc_static with emcmake and then used emcc to generate the wasm and javascript glue code for blosc.c. I then wrap the glue code in the Blosc codec. I use the WASM linear memory (HEAP8) to copy buffers between WebAssembly land and Javascript. I don't think there is a way to do this without copying unfortunately.

Since all wasm-linked code has to be loaded at runtime, I have the codecs return a promise for ecoding/decoding. I think this will fit nicely into zarr.js since all get and set methods are aysnc, and once the module has been resolved once it will return immediately on being awaited in subsequent calls. I am roughly basing this implementation on what I've seen in squoosh, where I found this elegant method for instantiating the emcscripten modules.

As a side note, I think allowing encoding/decoding to be async has the added benefit of allowing for off-main-thread decoding (using workers or pool of workers) if someone wants to configure that for a particular use case. The real benefit here is that zarr.js is async in nature, so anything within ZarrArray.get and ZarrArray.set can be awaited :)

Unfortunately, I am having trouble with building for multiple targets as well as testing, and could really use your insight/help @gzuidhof (if you have the time!). Anyways, you should be able to check you the branch and open codecs/blosc/example.html to see a working example!

image

~/GitHub/manzt/numcodecs.js/codecs/blosc
$ ls -lh
total 712
-rw-r--r--   1 trevormanz  staff   997B May  8 10:45 blosc_codec.c
-rw-r--r--   1 trevormanz  staff   568B May  8 09:48 blosc_codec.d.ts
-rw-r--r--   1 trevormanz  staff    16K May  8 11:39 blosc_codec.js
-rw-r--r--   1 trevormanz  staff   301K May  8 11:38 blosc_codec.wasm
drwxr-xr-x  11 trevormanz  staff   352B May  8 11:38 build
-rwxr-xr-x   1 trevormanz  staff   855B May  8 11:38 build.sh
drwxr-xr-x  29 trevormanz  staff   928B May  8 10:00 c-blosc
-rw-r--r--   1 trevormanz  staff   3.1K May  8 10:40 example.html

Total wasm size is ~300K with 16K (unminified glue code) (100K/5K gzipped)

manzt commented 4 years ago

Overall I'd be interested in trying to write our own glue code using es6 modules so that rollup has a much easier time and we can use @rullup/rollup-plugin-wasm to inline the wasm as base64. This increase the wasm size size, but it is very portable and will make sharing much easier.

However, my implementation does rely on some things provided by the Module from emscripten. I can't even seem to get the webassembly to load (even with the -s STANDALONE_WASM flag), so getting this to work would be great but would require some significant effort. Perhaps I can circle back next week.

manzt commented 4 years ago

Example using async decoding API, importing Blosc as an es6 module:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
  </head>
  <body>
    <script type="module">
      import Blosc from './blosc.js';
      (async () => {
        const size = 100000;
        const arr = new Uint32Array(size);
        for (let i = 0; i < size; i++) {
          arr[i] = i;
        }
        console.log('Original:', arr);

        const bytes = new Uint8Array(arr.buffer);
        console.log('Bytes:', bytes);

        const codec = new Blosc();
        const cBytes = await codec.encode(bytes);
        console.log('Compressed:', cBytes);

        const dBytes = await codec.decode(cBytes);
        console.log('Decompressed:', new Uint32Array(dBytes.buffer));
      })();
    </script>
  </body>
</html>
blosc.js code

```javascript import blosc_codec from './blosc_codec.js'; // emscripten generated module function initEmscriptenModule(moduleFactory) { return new Promise((resolve) => { const module = moduleFactory({ // Just to be safe, don't automatically invoke any wasm functions noInitialRun: true, onRuntimeInitialized() { // An Emscripten is a then-able that resolves with itself, causing an infite loop when you // wrap it in a real promise. Delete the `then` prop solves this for now. // https://github.com/kripken/emscripten/issues/5820 delete module.then; resolve(module); }, }); }); } const BLOSC_MAX_OVERHEAD = 16; const COMPRESSOR_MAP = new Map() .set('blosclz', 0) .set('lz4', 1) .set('lz4hc', 2) .set('snappy', 3) .set('zlib', 4) .set('zstd', 5); let emscriptenModule; class Blosc { constructor(clevel = 5, cname = 'lz4', shuffle = 1, blocksize = 0) { if (clevel < 0 || clevel > 9) { throw new Error( `Invalid blosc compression 'clevel', it should be between 0 and 9`, ); } if (!COMPRESSOR_MAP.has(cname)) { throw new Error( `Invalid compression name '${cname}', it should be one of 'blosclz', 'lz4', 'lz4hc','snappy', 'zlib', 'zstd'.`, ); } this.blocksize = blocksize; this.clevel = clevel; this.cname = cname; this.shuffle = shuffle; } static fromConfig({ blocksize, clevel, cname, shuffle }) { return new Blosc(clevel, cname, shuffle, blocksize); } async encode(data) { if (!emscriptenModule) { emscriptenModule = initEmscriptenModule(blosc_codec); } const module = await emscriptenModule; const { _b_compress: compress, _malloc, _free, HEAP8 } = module; const ptr = _malloc(data.byteLength + data.byteLength + BLOSC_MAX_OVERHEAD); const destPtr = ptr + data.byteLength; HEAP8.set(data, ptr); const cBytes = compress( ptr, destPtr, this.clevel, this.shuffle, this.blocksize, data.length, COMPRESSOR_MAP.get(this.cname), ); // check compression was successful if (cBytes <= 0) { throw Error(`Error during blosc compression: ${cBytes}`); } const resultView = new Uint8Array(HEAP8.buffer, destPtr, cBytes); const result = new Uint8Array(resultView); _free(ptr); return result; } async decode(data, out) { if (!emscriptenModule) { emscriptenModule = initEmscriptenModule(blosc_codec); } const module = await emscriptenModule; const { _b_decompress: decompress, _get_nbytes: getNbytes, _malloc, _free, HEAP8, } = module; // Allocate memory to copy source array const sourcePtr = _malloc(data.byteLength); HEAP8.set(data, sourcePtr); // Determine size of uncompressed array and allocate const nBytes = getNbytes(sourcePtr); const destPtr = _malloc(nBytes); const ret = decompress(sourcePtr, destPtr); if (ret <= 0) { throw Error(`Error during blosc decompression: ${ret}`); } const resultView = new Uint8Array(HEAP8.buffer, destPtr, nBytes); const result = new Uint8Array(resultView); _free(sourcePtr); _free(destPtr); if (out !== undefined) { out.set(result); return out; } return result; } } export default Blosc; ```

manzt commented 4 years ago

An alternate pattern I've seen:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
  </head>
  <body>
    <script type="module">
      import Blosc from './blosc.js';
      (async () => {
        const size = 100000;
        const arr = new Uint32Array(size);
        for (let i = 0; i < size; i++) {
          arr[i] = i;
        }
        console.log('Original:', arr);

        const bytes = new Uint8Array(arr.buffer);
        console.log('Bytes:', bytes);

        const codec = new Blosc();
        await codec.ready; // Promise that resolves once the wasm is initialized once 
        // Other instances of Blosc can call encode/decode immediately without 'await-ing'

        const cBytes = codec.encode(bytes);
        console.log('Compressed:', cBytes);

        const codec2 = new Blosc();
        const dBytes = codec2.decode(cBytes);
        console.log('Decompressed:', new Uint32Array(dBytes.buffer));
      })();
    </script>
  </body>
</html>
manzt commented 4 years ago

Re comments in #2: I agree that adding blosc out of convenience to a main distribution would be most developer friendly. In this regard, I think we should look into inlining the wasm as a base64 encoded string and writing some custom glue code. This way the main distribution of zarr could be a single, tree-shakeable module. In-lining the wasm typically increases wasm size by ~33%, but I think this is worth it in this context, until there is a better way to share emscripten webassbly as modules.

However, if we got this route, we should add a named submodule to zarr.js's package exports which does not include any codecs. This would be for more power users, who are concerned about bundle sizes for particular applications.

gzuidhof commented 4 years ago

I really don't have much useful to say about the C and wasm glue code, I have not played with it before, I am looking forward to it though! I can help you set up automated testing. Maybe we can find somebody more knowledgeable to review this code and give pointers?

For the build system, maybe we can consider switching to webpack, I always end up falling back to that for more complex build pipelines, it allows you to do more custom (&hacky) things which we may need here. What I propose is that instead of making numcodecs fit to what Rollup wants, we can use a webpack script that we can make fit to what we need.

For distribution: I agree let's put this in a string. We can compress it and decompress on load (convenient that we literally ship a bunch of ways to do that ^^), and base64/base85 it. I think in the end it's a win anyway because it will save a request, even if it's a bit larger in binary size.

manzt commented 4 years ago

Thanks for the comments! Before we have someone else look at it, I'm going to try to write the wasm source in C++ since there is better support for defining javascript bindings with embind, and I think there is a more elegant (and performant) way to encode/decode without just copying buffers and passing pointers. See here.

Once I'm done with that. I'm going to take a stab at writing the javascript glue code from hand. There is a lot there, but it's pretty straightforward. Some of the issues with bundling for multiple targets have been due to that code being designed to target a lot of different environments (which is hard for the bundler to make sense of). This way we will have better control over how the wasm is getting instantiated, and webpack can take care of creating different bundles. I don't have much experience with webpack so I'll really appreciate the help there!

I'm going to be busy this week with my term ending and finals, but I should have more to share (and more time on my hands) by the end of the week!

gzuidhof commented 4 years ago

Good luck with your finals! :)

manzt commented 4 years ago

@gzuidhof . Ok so I think I have rollup working. I refactored the C to C++, and the interface is much cleaner (and nicer to use). The hardest part remains trying to connect the javascript glue code from emscripten. I think there is a lot of unnecessary code there, but it would require a significant amount of time to write it in modern ES by hand (and leave it to the bundler to output for different targets). I feel good with where it is at right now though.

I added a step to codecs/blosc/build.sh that base64 encodes the wasm as a string in an es module (see blosc_codec_wasm.js). Now, the final export is a single javascript file (~432kb unminified). I got rid of the umd build for the time being (since I think these make the most sense as modules), but we can use the module export in zarr.

Right numcodecs is npm install-able (try running npm build && npm pack and then installing that tarball in another directory). The last thing to do is get the testing environment working. I decided to make encode/decode async rather than setting a ready prop on the codec.

manzt commented 4 years ago

I think I got tests working! We could ask others to review, but I'm feeling pretty good about this now. Probably should focus on incorporating with zarr.js somehow.

manzt commented 4 years ago

I opened a branch (numcodecs) in zarr.js which just takes the current release here and creates a registry with getCodec. I hope that's alright -- I won't open a PR until we release v0.1.0 for numcodecs, which I plan to do once this PR is merged (also make the repo public). Allowing for async decodeChunk and encodeChunk was really all it took, and there were no changes to zarr's public API :) As such, I didn't need to change any tests! Just added an end-to-end blosc example..

I think I'd like to explore a more dynamic registry in zarr.js, but I think for V1 having a release with all codecs bundled is probably desirable.