glue-viz / glue

Linked Data Visualizations Across Multiple Files
http://glueviz.org
Other
721 stars 152 forks source link

random_views_for_dask_array is not so random #2413

Open jfoster17 opened 1 year ago

jfoster17 commented 1 year ago

Describe the bug random_views_for_dask_array (called by calculate_statistic if we have a large dask array) only returns a good random set of the full dask array if there is not structure in the underlying array that is an integer multiple of the chunk size.

In some biology imaging data, there is structure in the image that aligns with dask chunk boundaries. Specifically this is the case when the full image is made by combining multiple sub-images ("wells"); each well can be a single chunk in the resulting dask array and each well has a typical illumination pattern where the edges are dimmer and middles are brighter.

This could arise in other applications as well -- wide-field images in astronomy might store sub-arrays as discrete chunks with significant structure within each chunk.

The problem is that random_views_for_dask_array takes a deterministic slice of each chunk. In a fairly typical case where n_per_chunk is smaller than the size of a chunk, this function will return a piece along the edge of each chunk. If, as above, there is structure so that the edge of each chunk is lower or higher than the average, statistics computed from just the edge of the array are bad.

To Reproduce Here we set up sub-arrays with a lot of structure (using range) and combine them into a single larger array.

from glue.utils import random_views_for_dask_array
import dask.array as da
import numpy as np

sub_arrays = [da.arange(4096).reshape(64,64).T for _ in range(32)]
c = da.concatenate(sub_arrays,axis=0)
# problem happens if random_subset // n_chunks << chunk_size
random_subset = 1000
n_chunks = 8
random_subset_indices_dask = (c.size, random_views_for_dask_array(c, random_subset, n_chunks))

(131072, [(slice(256, 320, None), slice(0, 1, None)), (slice(576, 640, None), slice(0, 1, None)), (slice(384, 448, None), slice(0, 1, None)), (slice(1792, 1856, None), slice(0, 1, None)), (slice(256, 320, None), slice(0, 1, None)), (slice(576, 640, None), slice(0, 1, None)), (slice(1344, 1408, None), slice(0, 1, None)), (slice(1344, 1408, None), slice(0, 1, None))])

All the slices are one thin strip along the edge of the sub-array, so our "random" sample is exactly the same piece of each sub-array.

# This is what compute_statistic does with the resulting indices
data = da.hstack([c[slices].ravel() for slices in random_subset_indices_dask[1]])
data.max().compute()
>>> 63
>>> 4095

Expected behavior random_views_for_dask_array should return a better random subset, otherwise min/max values for image scaling and (especially) histograms may be badly set.

Screenshots Here is the (too-large-to-share) file I was using when I discovered this problem, with a typical illumination pattern within each chunk. Chunk-dask-array

Details: