meggart / DiskArrays.jl

Other
75 stars 14 forks source link

What is the most efficient way to access discrete columns of a chuncked array? #49

Closed alex-s-gardner closed 2 years ago

alex-s-gardner commented 2 years ago

What is the most efficient way to access discrete columns of a zarr array? This follows on from the conversation: here

Lets say I have a Zarr array that looks like this:

z = rand(100,100,1000)

and let's pretend that z is a chunked Zarr array that lives on S3

Now if I want to get a series of discrete columns from z:

z[3,5,:]
z[5,9,:]
z[10,6,:]

Is there a way I can do this that is efficient and does not read in the same chunk of data more than once?

alex-s-gardner commented 2 years ago

Digging a bit more I've come up with the following:

mask = falses(size(z))
mask[[CartesianIndex((1,5)),CartesianIndex((5,9)),CartesianIndex((10,6))],:] .= true
z[mask]

is this the most efficient approach?

alex-s-gardner commented 2 years ago

In my specific case indexing using a logical array is extremely inefficient which leaves me without a practical solution:

mask = falses(size(foo["var"]))
mask[1, 1, :] .= true 
foo["var"][mask]

takes 30 seconds to read in and:

foo["var"][1,1,:] takes 0.5 seconds to read

meggart commented 2 years ago

Hi @alex-s-gardner I would really like do dig into this right now, but due to private issues + some project meetings this week I have to postpone this until next week. In case you don't hear anything back, feel free to ping me to remind me of the issue.

meggart commented 2 years ago

Finally the issue got solved through merging #59. Starting from your example:

z = rand(100,100,1000)
z[3,5,:]
z[5,9,:]
z[10,6,:]

You can now write as

z[[CartesianIndex((3,5)), CartesianIndex((5,9)),CartesianIndex((10,6))]]

and DiskArrays will make sure every affected chunks will be accessed only once, so you have optimal read performance from remote sources. ALternatively you can do:

mask = falses(100,100)
mask[3,5] = true
mask[5,9] = true
mask[10,6] = true
z[mask,:]

which will in the end access the same machinery as the example mentioned above.

alex-s-gardner commented 1 year ago

Just a note that the above should have been written as: z[[CartesianIndex((3,5)), CartesianIndex((5,9)),CartesianIndex((10,6))], :]

alex-s-gardner commented 1 year ago

@meggart I noticed the the dimensions get moved around in unintuitive ways

dc = Zarr.zopen(path2zarr)

size(dc["v"]) (834, 834, 84396)

r = [1 4 6 1] c = [7 8 9 3]

cartind = CartesianIndex.(r,c)

size(dc["v"][cartind, :]) (1, 84396, 4)

cartind = CartesianIndex.(r',c')

size(dc["v"][cartind, :]) (4, 84396, 1)

This is different behavior that non DiskArrays z = rand(100,100,1000)

cartind = CartesianIndex.(r,c)

size(z[cartind,:]) (1, 4, 1000)

cartind = CartesianIndex.(r',c') size(z[cartind,:]) (4, 1, 1000)