meggart / DiskArrays.jl

Other
79 stars 16 forks source link

Batch getindex #59

Closed meggart closed 2 years ago

meggart commented 2 years ago

This is an attempt to solve the use case brought up here: https://discourse.julialang.org/t/is-it-possible-to-index-into-a-set-of-columns-of-a-3d-array-in-a-single-line/75695 , where one wants to access a random batch of indices from a DiskArray. Simple loops won't help here because of the high latency, so it is best to first find all affected chunks and then read chunk by chunk.

Here I implemented the function disk_getindex_batch, which would support this workflow and I extended the normal getindex to work on (partial) vectors of CertesianIndex and Boolean masks. So the following are possible and fast although data are remote:

using Zarr
a = zopen("https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v2.1.1/esdc-8d-0.25deg-184x90x90-2.1.1.zarr")
ar = a["air_temperature_2m"]
size(ar)
#Index with vector of CartesianIndex
indstoread = [CartesianIndex(rand(1000:1100),rand(300:400)) for _ in 1:1000]
ar[indstoread,:]
#Index with mask that has lower dim than array
mask = falses(1440,720)
mask[200:202,500:502] .= true
mask[300:305,400:405] .= true
ar[mask,:]

Still missing are unit tests and maybe documentation.

meggart commented 2 years ago

This is starting to come into shape. In particular it will help for many use cases in #61 . For example the following code

using Zarr
a = zopen("https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v2.1.1/esdc-8d-0.25deg-1x720x1440-2.1.1.zarr")["air_temperature_2m"]
av = view(a,:,:,1:200:1840)
av[:,:,:]

runs pretty fast now. The data is chunked with chunk size 1 along time and when reading from the view only the affected chunks will be transfered from the remote source.