Thoughts towards a leaner zarr.js

manzt commented 3 years ago

TL;DR - I believe modularizing zarr.js further and stripping out unused/unnecessary features will make maintenance and adoption easier in the future. The key is deciding what are "core" features and prioritizing design around those as package exports.

Description

I love using zarr-python with zarr.js, but in its current state (even with the /core export) zarr.js is a bulky dependency (~150kb unminified). Although it is pure ESM and can be treeshaken, the primary issue I've come across is that you can't treeshake classes, and therefore in applications that I just need ZarrArray.getRawChunk, I still need to "pay the price" for get/getRaw/set/setRaw which contain all the chunk indexing/slicing logic.

To this end, I know that it's challenging to convince others to adopt zarr for web-applications (cc: @kylebarron). For example, it would be awesome to add a zarr tile loader to loaders.gl, but right now it makes more sense to duplicate parts of zarr.js to decrease the code footprint instead of importing exports from zarr.

Proposal 1: fully tree-shakeable submodules

I've been experimenting with an implementation of zarr v3 spec (zarrita.js), that is structured in a way to that you can progressively opt into more features (and code) if needed for your application. Right now the zarrita/core module only implements ZarrArray.get_chunk, and the top-level export extends the ZarrArray prototype to add indexing and slicing and writing to the array (e.g. ZarrArray.set, ZarrArray.get). Writing the v3 module in this way, I've been able to reuse the core/indexing logic in zarr-lite which is a minimal v2 implementation for reading chunks and optionally array selections.

Core zarr features

Specifically, I've been thinking about what are the core features of zarr? More motivation can be found on this twitter thread. To me, there are tiers of use that I'd like a zarr implementation to have on the web:

1.) Traverse hierarchy (open nodes of hierarchy, e.g. groups and arrays) + read chunks by key (most minimal set of features useful for the web – getRawChunk) 2.) (1) + read slices of array using array selection (typically how you think of using zarr in python –getRaw)) 3.) (1) + (2) + create (write) to hierarchy (generally unused on the web, but most feature "complete")

Right now, zarr.js is (3), but there is no way to just get the behavior or 1 or 2. In zarr-lite I've implemented 1 and 2 as separate exports:

This works by using Object.defineProperties and extending the ZarrArray prototype with additional methods in the submodule: https://github.com/manzt/zarr-lite/blob/main/src/indexing.js.

Benefits of isolating core exports

The main benefit of isolating modules is reduction in code side. You can import from the top-level export and get a "batteries included" zarr, but those writing other libraries or wanting to use core functionally can just import the submodules that make sense for their application. This means that the code for decoding chunks from an array won't be duplicated outside of zarr.js and instead folks will just add zarr as s dependency.

I also think this has the added benefit of standardizing how features like #16 could be implemented without creating a cost for "core" library users. Win/win.

Proposal 2: remove/move `RawArray` and `NestedArray` from `zarr.js`

In zarrita.js, I experimented with using a minimal object to represent an in-memory array: { data, shape, stride }. As a result, the arrays returned from ZarrArray.get can be composed into other JS ndarray libraries.

import ndarray from 'ndarray';

const selection = [slice(1,4), slice(2,7)];
const { data, shape, stride } = await z.get(selection);
const arr = ndarray(data, shape, stride);

/* perform some array operations */
await a.set(selection, arr); // set using ndarray since arr has properties { data, shape, stride }.

This removes the need for any array class implementation, and zarrita/indexing only needs to implement a set function to fill and out + out_selection from chunk + chunk_selection. https://github.com/manzt/zarrita.js/blob/main/src/ops.js

The big change is that zarr.js would serve as an interface to read slices of massive datasets into memory, but if you want to do something fancy with those arrays, you need to choose a numpy-like library. This would mean there is also much less code we need to maintain, and we can punt on creating something numpy-like for the web. It doesn't need to be a part of zarr.js at all, but a layer on top that we don't need to be responsible for.

Proposal 3: reduce code for metadata validation

I think that metadata validation should be very minimal. We control how the metadata is written (if writing) but also zarr-python takes care of writing metadata, so I don't think we need the overhead of something like a runtime typescript interface checker, but instead a small function that maybe checks a few fields for compatibility in javascript.

Proposal 4: Remove python-isms

We should think deeply about translating native python-isms to native JavaScript alternative. For example, to implement a valid store, currently we need to throw a custom KeyError from zarr if a chunk is missing. This makes any store implementation require zarr.js as a dependency, and adds the addtional (confusing requirement) and that you also can't bundle that store as a separate dependency (instanceof myCustomStore.KeyError !== instanceof zarr.KeyError) if I bundle myCustomStore independently. I'd like to have the ability to:

import { openArray } from 'https://cdn.skypack.dev/zarr';
import MyCustomStore from 'https://cdn.skypack.dev/@manzt/store'; // no dependency on `zarr`

const store = MyCustomStore();
const z = await openArray({ store });

A KeyError is a built in python error that's thrown when a key is missing from a MutableMapping. In JavaScript, the equivalent is to just return undefined. This would remove a lot of the try/catch blocks from the core implementation and instead we would just call:

// Store returns 'undefined' instead of throwing custom error if chunk is missing.
// Otherwise the store can throw any other error and it will propagate up.
const chunk = await this.store.get(ckey);
if (!chunk) {
  /* create the missing chunk */ 
} else {
  /* decode the chunk */
}

As an aside, I also feel that we should standardize store interfaces to be an extension of an ES6 Maps, but where are the functions are optionally async.

Final thoughts

My apologies for the length of this issue, but these are some thoughts that I've had after using zarr.js on the web for some time. I wanted to note these ideas and my experiments with zarrita.js and zarr-lite here so that we could maybe incorporate some of those lessons learned into a leaner, easier-to-maintain, zarr.js. This is something I might have the time to work on in the future, but it would mean substantial changes to this repo in it's current state so I don't want to proceed with PRs until we've had a longer discussion.

Excited to hear your thoughts.

manzt commented 3 years ago

For reference, here is the output of different zarr-lite exports (with rollup + terser):

$ npm run build 
src/index.js, src/indexing.js, src/httpStore.js → dist...
Created bundle index.js: 3.37 kB → 1.46 kB (gzip)
Created bundle indexing.js: 5.39 kB → 2.07 kB (gzip)
Created bundle httpStore.js: 656 B → 373 B (gzip)

My vision for this repo would be a "batteries included" top-level export dist/zarr.js that is (3) from above.

gzuidhof commented 3 years ago

Thank you for the detailed write up. I am completely in favour of reducing the bundle size of Zarr. I think it makes the most sense to cut fat where there is the most, which is probably in the dependencies.

All of the code in zarr.js itself is probably under 15KB gzipped. There are just not that many lines of code. If we drop something like NestedArray we can maybe shave off a few KB, but it's diminishing returns at that point. A website's logo is probably larger than that. The apps that zarr.js will power will probably be in the hundreds of KBs if not MBs anyway.

Bundlephobia shows the size over time well, but it's a bit misleading with the addition of new compressors.

Errors

I agree we can (and should) switch to errors that can be 'subclassed' or checked in user code. The current dependency on zarr for custom stores indeed doesn't have to be that way.

What I propose we do is introduce some sentinel value for errors. I think that falling back to undefined will make debugging very difficult and sometimes you do want to switch on the kind of error. What we could instead do is use something like this instead of subclassing:

isKeyError(err: Error) {
 return err.zarrError === "KEY_ERROR";
}
// or more generic
isSpecificZarrError(err: Error, s: string) {
 return err.zarrError === s;
}
isZarrError(err: Error) {
 return err.zarrError !== undefined;
}

Then we can export a file with all these string sentinel values for different types of errors.

Switching to a more modern target

This is an easy win, also for performance. I realized we are targeting ES6 at the moment, but I think that's way too old. It doesn't even have await and lots of stuff is being polyfilled. Every await becomes an awaiter polyfill mess.

I don't think people using zarr don't target very old browsers anyway, and if they do they can transpile using Babel themselves. By changing the target and module to ES2019 and ESNext respectively the size of core.mjs went to 123KB from 130KB unminified.

Dropping TS interface checker (`ts-interface-checker`)

Runtime type checking makes sense.. but also somewhat unnecessary if using Typescript. Dropping it can shave off 10-15KB I guesstimate before gzip. Let's drop it.

Shipping a minified unblde

I added a .min.mjs target, it makes sense for dynamic importing. We can play with terser's configuration to shave off another KB or two probably.

After removing the interface checker (saved 16KB minified!), I added a visualization tool to the build step, here's what it looks like (prior to gzip).

(Note I expanded the bundle on the right, so the huge batteries included bundle is not to scale (should be even larger).

minified+gzipped the core bundle has a size of 11841 bytes now which seems reasonable. If we want to squeeze this down more we can think about using dependency injection for certain functionalities to take them out of the class to allow for better treeshaking. Or we can write a smaller p-queue alternative, I think that could be 50 lines of code instead of 8KB minified.

I'll make a branch for these changes

gzuidhof commented 3 years ago

Here's the branch if you want to play with it: https://github.com/gzuidhof/zarr.js/tree/zarr-weight-loss

manzt commented 3 years ago

Thanks for the response! I should have time to review #70 this weekend.

What I propose we do is introduce some sentinel value for errors. I think that falling back to undefined will make debugging very difficult and sometimes you do want to switch on the kind of error.

I think this is a better idea than mine.

Then we can export a file with all these string sentinel values for different types of errors.

Personally, the only error I find myself reusing is KeyError when implementing a custom store, and I'm not really sure what other errors would be useful in user code. A section in the docs with: "Implementing a custom store" would probably suffice.

Switching to a more modern target. This is an easy win, also for performance.

Great idea.

After removing the interface checker (saved 16KB minified!), I added a visualization tool to the build step, here's what it looks like (prior to gzip).

Great, and thanks for adding the visualization! Looking forward to reviewing.

manzt commented 3 years ago

Thinking about this a bit more today wrt key errors. Part of me feels it's "correct" in the language context to mimic missing keys by returning undefined, but I see your point. It would be similarly typed to how an es6 map, except methods are async.

What are your thoughts on throwing a string literal "KEY_ERROR"? I just want something simple that makes store implementations very straightforward and without a dependency on zarr.

gzuidhof commented 3 years ago

"KEY_ERROR" would do the trick, but I would go for {zarrError: "KEY_ERROR"}. So the type for error would be something like:

type ErrorCode = "KEY_ERROR" | "SOME_OTHER_ERROR" | "ETC_ERROR";
type ZarrError = { zarrError: ErrorCode };

One day there will be a store or user that will want another field for something to bubble up information about the event. The first example I can come up with that isn't entirely realistic: There's a store with many keys, but you can only read from some with your user. Getting some keys may result in some sort of "permission denied" error, but with just the string literal it would be hard to let the user know.

gzuidhof / zarr.js