vitessce / vitessce-data

Utils for loading HuBMAP data formats
MIT License
6 stars 4 forks source link

change ordering to mz,y,x and chunk for better spatial access #71

Closed manzt closed 4 years ago

manzt commented 4 years ago

Fixes #70 and adds thoughtful chunking.

Under the hood, n-dimensional arrays are stored as contiguous buffers. Elements at particular indices can be accessed in each dimension of an array by using strides and offsets:

import numpy as np
arr = np.arange(3 * 5 * 5).reshape(3, 5, 5)

# numpy
a = arr[1,2,4]

# manual with offsets + strides
offset = 0
strides = [int(i / 8) for i in arr.strides] # strides in bytes
b = arr.ravel()[offset + strides[0] * 1 + strides[1] * 2 + strides[2] * 4]

a == b == 39 # True

Thus, the ordering of dimensions is important depending on what dimensions need to be accessed quickly. For rendering 2D layers (i.e. a particular mz channel), it makes sense to chunk our arrays so that we can quickly access a full channel sequence on the client.

import { openArray } from 'zarr';

const config = {
    store: "http://vitessce-data.s3.amazonaws.com/<version>",
    path: "spraggins.ims.zarr"
}

const z = await openArray(config); // initialize connection with store
console.log(z.chunks) // [1, 602, 733]
const mz_channel = await z.getRaw([3, null, null]); // single TypedArray of length 602 x 733
manzt commented 4 years ago

I've had some discussion with Heath at Vanderbilt about creating an identical zarr store which contains the mz channels in a more contiguous ordering. The tradeoff here is that we add computation and increase storage, but it will afford better performance when trying to make selections along the mz axis of the data.

Use case: we fetch the x,y optimized store for "images" at certain mz channels, and then use x,y selection from the screen to fetch the mz ratios for a selection and show a spectrum view. Heath noted that the biggest commercial software (scils.de) for IMS does this type of optimization.