Closed bjarthur closed 7 months ago
Probably not a DD problem, here we just resolve the dims and selectors to regular indices and pass them to the parent array.
YAX/NetCDF/DiskArrays.jl do the actual reading
@meggart may know more about that part
(also just try it with Int/colon on the parent array so DD is not involved. DD.dims2indices(A, I)
will get you the resolved infices)
can you elaborate on DD.dim2indices
? i'm not following...
also, you see above how i showed it's fast without At
and slow with it? wouldn't that indicate it is a DD problem?
Sorry typo dims2indices
.
DimensionalData is mostly a layer over AbstractArray indexing. It calls dims2indices
on your dimensions and selectors then passes the results to the parent object.
Using inds = DD.dims2indices(NC, (At... etc)))
manually to get the resolved indices will show you whats actually happening. Pass the inds as a tuple in the second argument.
Then passing the resulting indices to parent(NC)[inds...]
will give you a benchmark wher DD is not involved at all, and hopefully help us assign blame ;)
indeed, the problem lies elsewhere:
julia> @time I = DimensionalData.dims2indices(ZARR, (X(At(-1)),))
0.000007 seconds (1 allocation: 16 bytes)
(12, Colon(), Colon())
julia> @time getindex(ZARR, I...);
0.006218 seconds (3.53 k allocations: 1.637 MiB)
julia> @time I = DimensionalData.dims2indices(ZARR, (X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))))
0.000006 seconds (3 allocations: 8.906 KiB)
(12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000])
julia> @time getindex(ZARR, I...);
0.011937 seconds (108.19 k allocations: 19.329 MiB)
julia> @time I = DimensionalData.dims2indices(NC, (X(At(-1)),))
0.000002 seconds (1 allocation: 16 bytes)
(12, Colon(), Colon())
julia> @time getindex(NC, I...);
0.001656 seconds (171 allocations: 793.469 KiB)
julia> @time I = DimensionalData.dims2indices(NC, (X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))))
0.000008 seconds (3 allocations: 8.906 KiB)
(12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000])
julia> @time getindex(NC, I...);
77.239059 seconds (15.93 M allocations: 1.158 GiB, 0.08% gc time)
thanks! will keep digging...
Probably NetCDF (DiskArrays.jl) chunk loading of that vector. Likely fixed on DiskArrays.jl main
already.
But either way, using At
like that is not great. Why not use ..
for this and get a range back? A vector throws away the structural infirmation, and At will do hundreds of lookups where ..
does 2.
(DiskArrays.jl on main is just smart enough to put the range back together!!!)
I am wondering what is the comparison if you use Between for the ranges instead of At. Could we run into multiple scalar indexing?
Between
is deprecated for ..
tried to debug with the master branch of all related packages and there are unsatisfiable requirements in the versions:
(TXYZTranscriptomeGUI) pkg> activate --temp
Activating new project at `/var/folders/s5/8d629n5d7nsf37f60_91wzr40000gq/T/jl_hzHUzN`
(jl_hzHUzN) pkg> dev YAXArrays YAXArrayBase DimensionalData Zarr NetCDF DiskArrays
Cloning git-repo `https://github.com/JuliaDataCubes/YAXArrayBase.jl.git`
Cloning git-repo `https://github.com/JuliaGeo/NetCDF.jl.git`
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package DiskArrays [3c3547ce]:
DiskArrays [3c3547ce] log:
├─possible versions are: 0.4.0 or uninstalled
├─restricted to versions 0.3 by NetCDF [30363a11] — no versions left
│ └─NetCDF [30363a11] log:
│ ├─possible versions are: 0.11.7 or uninstalled
│ └─NetCDF [30363a11] is fixed to version 0.11.7
└─DiskArrays [3c3547ce] is fixed to version 0.4.0
You may need to edit the Project.toml of NetCDF.jl to live on the edge like that...
And be warned, we are talking about changes merged in the last few days, here be dragons.
And the question remains: why use At like that at all, even on the old versions? Why not ..
?
i use At
in the production code with a vector, that might have missing indices in the middle. just simplified it here for debugging purposes to use a contiguous range.
the netcdf file seems to be reopened for each chunk as it iterates over them! specifically, this line here:
https://github.com/meggart/DiskArrays.jl/blob/v0.3.23/src/batchgetindex.jl#L91
calls this line
https://github.com/JuliaDataCubes/YAXArrayBase.jl/blob/master/src/datasets/netcdf.jl#L29
not sure why the netcdf code can't mimic the zarr code, where eachchunk()
could be defined as DiskArrays.eachchunk(a::NcVar) = DiskArrays.GridChunks(a,a.chunksize)
. the chunk size is stored in both structs:
julia> zarr.metadata.chunks
(1, 10, 100)
julia> nc.chunksize
(100, 10, 1)
Maybe try a PR changing it?
Also closing as this is really way outside of DDs sphere if influence ;)
There should be branches on all relevant packages for the DiskArrays break. Lets get them merged and revisit this next week and we can reopen this on DiskArrays or YAXArrays as needed.
but only when using
At()
for most dimensions: