holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.31k stars 366 forks source link

Quick out-of-core downsampling to provide an overview of large data #560

Open jbednar opened 6 years ago

jbednar commented 6 years ago

As suggested in https://github.com/bokeh/datashader/issues/553, we need to improve our raster support so that it can access large rasters chunk by chunk. This will let us work with raster datasets larger than system memory, just as we can currently work with larger-than-memory columnar Dask dataframes, and should make it very practical to work with a small, zoomed-in region of a very large raster dataset.

However, even if full resolution is needed only on deep zooms, very often it is not clear which region of the large dataset should be viewed at full resolution, until a zoomed-out overview of the data has been provided to the user. And unfortunately, constructing such an overview using datashader will currently require making a complete pass through the entire dataset, reading all of the data into Python in order to aggregate it into a coarse grid.

Even if the Python code were to explicitly downsample the data, accessing only a strided subset of the data from disk, with most file formats and storage technologies the entire dataset will still end up being accessed.

After the above issue is addressed, it would be great if we could work on some specific combinations of aggregations and file formats where constructing such an overview would be relatively cheap. We could also consider making examples of having some pre-processed overview of the data stored with it, and then using HoloViews to switch dynamically between the overview and the full data depending on zoom level.

jbednar commented 6 years ago

Note that I'm discussing raster data above, but the same issues apply to timeseries data, where a very long timeseries needs a quick overview that can be zoomed in to show the full data.

LEMUEGGE commented 6 years ago

I am currently implementing something similar for our data. I came along #553 and then started working on this myself. So far I didn't find a way to wrap xarrays apply_ufunc or dask.map_blocks even though the latter would support ghosting which, as mentioned by @philippjfr, will probably be of importance for the sampling. The problem is that one needs to know the size of the processed data chunks and these will be chunk-dependent for non-symmetrically chunked data sets (i.e. an array with size 2055 along one axis, chunked into 1000,1000,55). However, for the sampling methods that I am currently interested in: nearest neighbor upsampling and area mean downsampling, it seems that there is a satisfactory solution by using dask.delayed and manual creation of chunks. I addressed the issue of variable chunk sizes in unsymmetric datasets by instead dividing the plot canvas in equal chunks and map( as good as possible ) equal data chunks on these tiles. My algorithms looks like this:

-1. data = dask.array.from_array(hdf5 file, chunks=(1000,1000))

  1. preprocess xrange and yrange to match with data extent and resolution

estimate number of plot tiles

  1. xtiles = max(xrange // 2000,1) # something that fits good in memory

  2. ytiles = max(yrange // 2000,1)

  3. dask_delayed = [dask.delayed(resampling_function)( data[ith chunk in dim 1 : jth chunk in dim 2], plotwidth,plotheight, indicator of which tile (where in canvas and extents of tile) this data should be mapped to as (a function of i,j) ) for i,j in meshgrid (range(xtiles),range(ytiles))]

  4. agg = dask.delayed(numpy.sum)(dask_delayed)

  5. agg.compute(num_workers=4)

Inside the resampling_function, I initialize an array of zeros with the size of the plot canvas and fill one tile with the sampled array from gridtools.resample_2d. In the end when I sum over dask_delayed I end up with the complete agg. I am still testing the equivalence but so far it seems to work very well. On my laptop I can downsample more than 7 GByte in a few seconds.

In addition I added a cache of the full downsampled dataset in my plot class so that resetting to the full data extent, by far the biggest computation, is only performed once.

LEMUEGGE commented 6 years ago

Thinking a little bit more about it, above works well for me without any chunk overlap because of the data (a matrix of probabilities). The values are of the same magnitude in their respective neighborhood and will converge to an area mean regardless of where exactly the split occurs. For more heterogeneous data, I definitely expect some problems in the down-sampling process, although one can alter above algo to data [(ith + 1/2) chunk:(jth +1/ 2) chunk] to have some overlap. In the upsampling process, we are only dealing with one data chunk in a canvas with a reasonable size of w,h < 1000 pixel anyway and no overlap is needed.

youngsaj commented 5 years ago

Now that #553 is done, and I've been able to successfully use that, this one comes to light again. I find when trying to do a downsample and zoom it ends up with the wrong range during zoom. Is there any tricks to using resample_2d that would help? It seems like the above was a mechanism to solve 553. I want to solve not having to have it ultimately read all the data. I just want a very partial read of the data for a fast overview, but still have the zooming ranges work correctly. I've done it brute force, constantly rereading, but would prefer some sort of mixed datashader mechanism, or recommendations on how to get to that.