manzt / zarrita.js

A JavaScript toolkit for working with chunked, compressed, n-dimensional arrays
https://zarrita.dev
MIT License
42 stars 5 forks source link

`bigint` for `zarr.slice` (and maybe elsewhere) indexing #179

Open ilan-gold opened 4 months ago

ilan-gold commented 4 months ago

Hello!

Would it be possible to change the {S,s}lice type to accept bigint? I will make a PR to demo/check this out myself. Just wanted to make an issue. Maybe I will discover this is not an issue but ATM:

zarr.slice(10n, 15)

gives me a type error:

Argument of type 'number | bigint' is not assignable to parameter of type 'number | null'.
  Type 'bigint' is not assignable to type 'number'.ts(2345)
manzt commented 4 months ago

Interesting question. I hadn't considered it, are the offsets you're indexing beyond the largest you can specify with Number? Or is this just a convenience to coerce a bigint into a number?

I can imagine a case where you are indexing a subset of a massive array, and in that case I wonder if we should prefer bigint for slicing and indexing. You can't index into a TypedArray or Array with bigint, so somewhere it needs to be converted into a number (with the offset removed), but I'd rather avoid this complexity unless there is a use case that requires it.

ilan-gold commented 4 months ago

So we do have a specific use-case here, which is indexing into sparse arrays on disk (see https://github.com/scverse/anndata/pull/1348 https://github.com/scverse/anndata/issues/1261 and https://github.com/scverse/anndata/pull/1493).

You can't index into a TypedArray or Array with bigint

I was surprised to discover that actually Array(10).fill(15)[5n] works and returns 15 unless I am missing something, and same for TypedArrays. Maybe I'm misinterpreting this though. My understanding is that there is an upper bound, though, on the size of the array, which is 2^31 - 1 (although this can be fudged, which I am using to hopefully construct tests for this).

In any case, I actually don't really care about the indexing use-case for in-memory. I am more interested in pulling ranges from the on-disk data, which I think is actually something that is a general issue i.e., constructing a slice for a zarr array longer than 2^31 - 1 elements seems like it's a good use-case. Would this require indexing into an array with a bigint? I think maybe not, but I'm just starting to peer into the internals here.

ilan-gold commented 4 months ago

Hmmm, there is quite a bit of arithmetic that gets much more challenging with bigint. This is quite complex. I will need to think about this a bit, spend more time with the code base. My use-case is simple, but the potential use-cases in this package are manifold so implementing this is not straightforward.

manzt commented 4 months ago

Right, I was assuming that would be a problem. I can see the value of being able slice the absolute space of the array using bigint, but we need to turn any of that indexing into relative offsets internally and there is quite a bit of arithmetic to handle that case. I'm not sure I'll have time to work on this, but I'd be open to turning all indexing logic internally into bigint-based, and then expose a more flexible top level zarr.slice to allow either number/bigint (and convert to bigint).

ilan-gold commented 4 months ago

Oh that'd be interesting and is definitely more approachable. If you're open to it, I can take a whack at it soon.