Closed manzt closed 4 years ago
Overall I'd be interested in trying to write our own glue code using es6 modules so that rollup has a much easier time and we can use @rullup/rollup-plugin-wasm
to inline the wasm as base64
. This increase the wasm size size, but it is very portable and will make sharing much easier.
However, my implementation does rely on some things provided by the Module
from emscripten. I can't even seem to get the webassembly to load (even with the -s STANDALONE_WASM
flag), so getting this to work would be great but would require some significant effort. Perhaps I can circle back next week.
Example using async decoding API, importing Blosc
as an es6 module:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<script type="module">
import Blosc from './blosc.js';
(async () => {
const size = 100000;
const arr = new Uint32Array(size);
for (let i = 0; i < size; i++) {
arr[i] = i;
}
console.log('Original:', arr);
const bytes = new Uint8Array(arr.buffer);
console.log('Bytes:', bytes);
const codec = new Blosc();
const cBytes = await codec.encode(bytes);
console.log('Compressed:', cBytes);
const dBytes = await codec.decode(cBytes);
console.log('Decompressed:', new Uint32Array(dBytes.buffer));
})();
</script>
</body>
</html>
```javascript import blosc_codec from './blosc_codec.js'; // emscripten generated module function initEmscriptenModule(moduleFactory) { return new Promise((resolve) => { const module = moduleFactory({ // Just to be safe, don't automatically invoke any wasm functions noInitialRun: true, onRuntimeInitialized() { // An Emscripten is a then-able that resolves with itself, causing an infite loop when you // wrap it in a real promise. Delete the `then` prop solves this for now. // https://github.com/kripken/emscripten/issues/5820 delete module.then; resolve(module); }, }); }); } const BLOSC_MAX_OVERHEAD = 16; const COMPRESSOR_MAP = new Map() .set('blosclz', 0) .set('lz4', 1) .set('lz4hc', 2) .set('snappy', 3) .set('zlib', 4) .set('zstd', 5); let emscriptenModule; class Blosc { constructor(clevel = 5, cname = 'lz4', shuffle = 1, blocksize = 0) { if (clevel < 0 || clevel > 9) { throw new Error( `Invalid blosc compression 'clevel', it should be between 0 and 9`, ); } if (!COMPRESSOR_MAP.has(cname)) { throw new Error( `Invalid compression name '${cname}', it should be one of 'blosclz', 'lz4', 'lz4hc','snappy', 'zlib', 'zstd'.`, ); } this.blocksize = blocksize; this.clevel = clevel; this.cname = cname; this.shuffle = shuffle; } static fromConfig({ blocksize, clevel, cname, shuffle }) { return new Blosc(clevel, cname, shuffle, blocksize); } async encode(data) { if (!emscriptenModule) { emscriptenModule = initEmscriptenModule(blosc_codec); } const module = await emscriptenModule; const { _b_compress: compress, _malloc, _free, HEAP8 } = module; const ptr = _malloc(data.byteLength + data.byteLength + BLOSC_MAX_OVERHEAD); const destPtr = ptr + data.byteLength; HEAP8.set(data, ptr); const cBytes = compress( ptr, destPtr, this.clevel, this.shuffle, this.blocksize, data.length, COMPRESSOR_MAP.get(this.cname), ); // check compression was successful if (cBytes <= 0) { throw Error(`Error during blosc compression: ${cBytes}`); } const resultView = new Uint8Array(HEAP8.buffer, destPtr, cBytes); const result = new Uint8Array(resultView); _free(ptr); return result; } async decode(data, out) { if (!emscriptenModule) { emscriptenModule = initEmscriptenModule(blosc_codec); } const module = await emscriptenModule; const { _b_decompress: decompress, _get_nbytes: getNbytes, _malloc, _free, HEAP8, } = module; // Allocate memory to copy source array const sourcePtr = _malloc(data.byteLength); HEAP8.set(data, sourcePtr); // Determine size of uncompressed array and allocate const nBytes = getNbytes(sourcePtr); const destPtr = _malloc(nBytes); const ret = decompress(sourcePtr, destPtr); if (ret <= 0) { throw Error(`Error during blosc decompression: ${ret}`); } const resultView = new Uint8Array(HEAP8.buffer, destPtr, nBytes); const result = new Uint8Array(resultView); _free(sourcePtr); _free(destPtr); if (out !== undefined) { out.set(result); return out; } return result; } } export default Blosc; ```
An alternate pattern I've seen:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<script type="module">
import Blosc from './blosc.js';
(async () => {
const size = 100000;
const arr = new Uint32Array(size);
for (let i = 0; i < size; i++) {
arr[i] = i;
}
console.log('Original:', arr);
const bytes = new Uint8Array(arr.buffer);
console.log('Bytes:', bytes);
const codec = new Blosc();
await codec.ready; // Promise that resolves once the wasm is initialized once
// Other instances of Blosc can call encode/decode immediately without 'await-ing'
const cBytes = codec.encode(bytes);
console.log('Compressed:', cBytes);
const codec2 = new Blosc();
const dBytes = codec2.decode(cBytes);
console.log('Decompressed:', new Uint32Array(dBytes.buffer));
})();
</script>
</body>
</html>
Re comments in #2: I agree that adding blosc out of convenience to a main distribution would be most developer friendly. In this regard, I think we should look into inlining the wasm as a base64
encoded string and writing some custom glue code. This way the main distribution of zarr could be a single, tree-shakeable module. In-lining the wasm typically increases wasm size by ~33%, but I think this is worth it in this context, until there is a better way to share emscripten webassbly as modules.
However, if we got this route, we should add a named submodule
to zarr.js's package exports which does not include any codecs. This would be for more power users, who are concerned about bundle sizes for particular applications.
I really don't have much useful to say about the C and wasm glue code, I have not played with it before, I am looking forward to it though! I can help you set up automated testing. Maybe we can find somebody more knowledgeable to review this code and give pointers?
For the build system, maybe we can consider switching to webpack, I always end up falling back to that for more complex build pipelines, it allows you to do more custom (&hacky) things which we may need here. What I propose is that instead of making numcodecs fit to what Rollup wants, we can use a webpack script that we can make fit to what we need.
For distribution: I agree let's put this in a string. We can compress it and decompress on load (convenient that we literally ship a bunch of ways to do that ^^), and base64/base85 it. I think in the end it's a win anyway because it will save a request, even if it's a bit larger in binary size.
Thanks for the comments! Before we have someone else look at it, I'm going to try to write the wasm source in C++ since there is better support for defining javascript bindings with embind, and I think there is a more elegant (and performant) way to encode/decode without just copying buffers and passing pointers. See here.
Once I'm done with that. I'm going to take a stab at writing the javascript glue code from hand. There is a lot there, but it's pretty straightforward. Some of the issues with bundling for multiple targets have been due to that code being designed to target a lot of different environments (which is hard for the bundler to make sense of). This way we will have better control over how the wasm is getting instantiated, and webpack can take care of creating different bundles. I don't have much experience with webpack so I'll really appreciate the help there!
I'm going to be busy this week with my term ending and finals, but I should have more to share (and more time on my hands) by the end of the week!
Good luck with your finals! :)
@gzuidhof . Ok so I think I have rollup working. I refactored the C to C++, and the interface is much cleaner (and nicer to use). The hardest part remains trying to connect the javascript glue code from emscripten. I think there is a lot of unnecessary code there, but it would require a significant amount of time to write it in modern ES by hand (and leave it to the bundler to output for different targets). I feel good with where it is at right now though.
I added a step to codecs/blosc/build.sh
that base64
encodes the wasm as a string in an es module (see blosc_codec_wasm.js
). Now, the final export is a single javascript file (~432kb unminified). I got rid of the umd
build for the time being (since I think these make the most sense as modules), but we can use the module
export in zarr.
Right numcodecs
is npm install-able (try running npm build && npm pack
and then installing that tarball in another directory). The last thing to do is get the testing environment working. I decided to make encode/decode
async rather than setting a ready
prop on the codec.
I think I got tests working! We could ask others to review, but I'm feeling pretty good about this now. Probably should focus on incorporating with zarr.js somehow.
I opened a branch (numcodecs
) in zarr.js which just takes the current release here and creates a registry with getCodec
. I hope that's alright -- I won't open a PR until we release v0.1.0
for numcodecs, which I plan to do once this PR is merged (also make the repo public). Allowing for async decodeChunk
and encodeChunk
was really all it took, and there were no changes to zarr's public API :) As such, I didn't need to change any tests! Just added an end-to-end blosc example..
I think I'd like to explore a more dynamic registry in zarr.js, but I think for V1 having a release with all codecs bundled is probably desirable.
I think this is really close to being complete. I compiled
c-blosc
to llvmblosc_static
withemcmake
and then usedemcc
to generate the wasm and javascript glue code forblosc.c
. I then wrap the glue code in theBlosc
codec. I use the WASM linear memory (HEAP8
) to copy buffers between WebAssembly land and Javascript. I don't think there is a way to do this without copying unfortunately.Since all wasm-linked code has to be loaded at runtime, I have the codecs return a promise for ecoding/decoding. I think this will fit nicely into zarr.js since all
get
andset
methods are aysnc, and once the module has been resolved once it will return immediately on being awaited in subsequent calls. I am roughly basing this implementation on what I've seen insquoosh
, where I found this elegant method for instantiating the emcscripten modules.As a side note, I think allowing encoding/decoding to be async has the added benefit of allowing for off-main-thread decoding (using workers or pool of workers) if someone wants to configure that for a particular use case. The real benefit here is that
zarr.js
is async in nature, so anything withinZarrArray.get
andZarrArray.set
can beawaited
:)Unfortunately, I am having trouble with building for multiple targets as well as testing, and could really use your insight/help @gzuidhof (if you have the time!). Anyways, you should be able to check you the branch and open
codecs/blosc/example.html
to see a working example!Total wasm size is ~300K with 16K (unminified glue code) (100K/5K gzipped)