Closed alex-s-gardner closed 2 years ago
Digging a bit more I've come up with the following:
mask = falses(size(z))
mask[[CartesianIndex((1,5)),CartesianIndex((5,9)),CartesianIndex((10,6))],:] .= true
z[mask]
is this the most efficient approach?
In my specific case indexing using a logical array is extremely inefficient which leaves me without a practical solution:
mask = falses(size(foo["var"]))
mask[1, 1, :] .= true
foo["var"][mask]
takes 30 seconds to read in and:
foo["var"][1,1,:]
takes 0.5 seconds to read
Hi @alex-s-gardner I would really like do dig into this right now, but due to private issues + some project meetings this week I have to postpone this until next week. In case you don't hear anything back, feel free to ping me to remind me of the issue.
Finally the issue got solved through merging #59. Starting from your example:
z = rand(100,100,1000)
z[3,5,:]
z[5,9,:]
z[10,6,:]
You can now write as
z[[CartesianIndex((3,5)), CartesianIndex((5,9)),CartesianIndex((10,6))]]
and DiskArrays will make sure every affected chunks will be accessed only once, so you have optimal read performance from remote sources. ALternatively you can do:
mask = falses(100,100)
mask[3,5] = true
mask[5,9] = true
mask[10,6] = true
z[mask,:]
which will in the end access the same machinery as the example mentioned above.
Just a note that the above should have been written as: z[[CartesianIndex((3,5)), CartesianIndex((5,9)),CartesianIndex((10,6))], :]
@meggart I noticed the the dimensions get moved around in unintuitive ways
dc = Zarr.zopen(path2zarr)
size(dc["v"])
(834, 834, 84396)
r = [1 4 6 1]
c = [7 8 9 3]
cartind = CartesianIndex.(r,c)
size(dc["v"][cartind, :])
(1, 84396, 4)
cartind = CartesianIndex.(r',c')
size(dc["v"][cartind, :])
(4, 84396, 1)
This is different behavior that non DiskArrays
z = rand(100,100,1000)
cartind = CartesianIndex.(r,c)
size(z[cartind,:])
(1, 4, 1000)
cartind = CartesianIndex.(r',c')
size(z[cartind,:])
(4, 1, 1000)
What is the most efficient way to access discrete columns of a zarr array? This follows on from the conversation: here
Lets say I have a Zarr array that looks like this:
z = rand(100,100,1000)
and let's pretend that z is a chunked Zarr array that lives on S3
Now if I want to get a series of discrete columns from
z
:Is there a way I can do this that is efficient and does not read in the same chunk of data more than once?