rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
274 stars 39 forks source link

NetCDF is 1000x slower when indexing compared to Zarr #642

Closed bjarthur closed 7 months ago

bjarthur commented 7 months ago

but only when using At() for most dimensions:

julia> using YAXArrays, YAXArrayBase, DimensionalData, Zarr, NetCDF, DiskArrays

       # create a random DimArray

julia> A = DimArray(rand(12, 100, 1000),
                    (X(-12:-1), Y(-100:-1), Z(-1000:-1)),
                    metadata = Dict{String, Any}());

       # save it as Zarr and NetCDF, using YAXArray (is there another way?)

julia> YAX = yaxconvert(YAXArray, A);

julia> YAX_chunked = setchunks(YAX, (1,10,100));

julia> savecube(YAX_chunked, "A.zarr", driver=:zarr, overwrite=true);

julia> savecube(YAX_chunked, "A.nc", driver=:netcdf, overwrite=true, compress=true);

       # read it back in as a DimArray

julia> _ZARR = Cube("A.zarr")
12×100×1000 YAXArray{Float64,3} with dimensions: 
  X Sampled{Int64} -12:1:-1 ForwardOrdered Regular Points,
  Y Sampled{Int64} -100:1:-1 ForwardOrdered Regular Points,
  Z Sampled{Int64} -1000:1:-1 ForwardOrdered Regular Points
name: layer
Total size: 9.16 MB

julia> ZARR = DimArray(_ZARR.data, dims(_ZARR));

julia> _NC = Cube("A.nc")
12×100×1000 YAXArray{Float64,3} with dimensions: 
  X Sampled{Int64} -12:1:-1 ForwardOrdered Regular Points,
  Y Sampled{Int64} -100:1:-1 ForwardOrdered Regular Points,
  Z Sampled{Int64} -1000:1:-1 ForwardOrdered Regular Points
name: layer
Total size: 9.16 MB

julia> NC = DimArray(_NC.data, dims(_NC));

       # and direct with backend

julia> zarr = zopen("A.zarr").arrays["layer"];

julia> nc = NetCDF.open("A.nc", "layer");

       # JIT warmup

julia> zarr[1, :, :] == nc[1, :, :]
true

julia> zarr[2, 1:10, 1:100] == nc[2, 1:10, 1:100]
true

julia> ZARR[X(3)] == NC[X(3)]
true

julia> ZARR[X(4), Y(1:2), Z(1:2)] == NC[X(4), Y(1:2), Z(1:2)]
true

julia> ZARR[X(At(-8))] == NC[X(At(-8))]
true

julia> ZARR[X(At(-7)), Y(At(-100:-99)), Z(At(-1000:-99))] ==
               NC[X(At(-7)), Y(At(-100:-99)), Z(At(-1000:-99))]

       # now time it; a different chunk each time so shouldn't be cached?
true

julia> @time zarr[7, :, :];
  0.004708 seconds (3.54 k allocations: 1.637 MiB)

julia> @time nc[7, :, :];
  0.003594 seconds (32 allocations: 782.344 KiB)

julia> @time zarr[8, 1:10, 1:100];
  0.000084 seconds (61 allocations: 26.539 KiB)

julia> @time nc[8, 1:10, 1:100];
  0.000072 seconds (14 allocations: 8.625 KiB)                         ### fast

julia> @time ZARR[X(9)];
  0.004649 seconds (3.53 k allocations: 1.637 MiB)

julia> @time NC[X(9)];
  0.004894 seconds (169 allocations: 793.188 KiB)                     ### fast

julia> @time ZARR[X(10), Y(1:100), Z(1:1000)];
  0.005280 seconds (3.53 k allocations: 1.637 MiB)

julia> @time NC[X(10), Y(1:100), Z(1:1000)];  
  0.004639 seconds (161 allocations: 793.031 KiB)                           ### fast

julia> @time ZARR[X(At(-2))];
  0.004922 seconds (3.53 k allocations: 1.637 MiB)

julia> @time NC[X(At(-2))]; 
  0.004810 seconds (169 allocations: 793.188 KiB)                       ### fast

julia> @time ZARR[X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))];
  0.011309 seconds (108.19 k allocations: 19.337 MiB)

julia> @time NC[X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))];
 71.847494 seconds (15.93 M allocations: 1.158 GiB, 0.09% gc time)         ### slow :(

julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_PROJECT = @.
  JULIA_EDITOR = vi

(jl_mNZQc0) pkg> st
Status `/private/var/folders/s5/8d629n5d7nsf37f60_91wzr40000gq/T/jl_mNZQc0/Project.toml`
  [0703355e] DimensionalData v0.25.8
  [3c3547ce] DiskArrays v0.3.23
  [30363a11] NetCDF v0.11.7
  [90b8fcef] YAXArrayBase v0.6.1
  [c21b50f5] YAXArrays v0.5.3
  [0a941bbe] Zarr v0.9.2
rafaqz commented 7 months ago

Probably not a DD problem, here we just resolve the dims and selectors to regular indices and pass them to the parent array.

YAX/NetCDF/DiskArrays.jl do the actual reading

@meggart may know more about that part

(also just try it with Int/colon on the parent array so DD is not involved. DD.dims2indices(A, I) will get you the resolved infices)

bjarthur commented 7 months ago

can you elaborate on DD.dim2indices? i'm not following...

bjarthur commented 7 months ago

also, you see above how i showed it's fast without At and slow with it? wouldn't that indicate it is a DD problem?

rafaqz commented 7 months ago

Sorry typo dims2indices.

DimensionalData is mostly a layer over AbstractArray indexing. It calls dims2indices on your dimensions and selectors then passes the results to the parent object.

Using inds = DD.dims2indices(NC, (At... etc))) manually to get the resolved indices will show you whats actually happening. Pass the inds as a tuple in the second argument.

Then passing the resulting indices to parent(NC)[inds...] will give you a benchmark wher DD is not involved at all, and hopefully help us assign blame ;)

bjarthur commented 7 months ago

indeed, the problem lies elsewhere:

julia> @time I = DimensionalData.dims2indices(ZARR, (X(At(-1)),))
  0.000007 seconds (1 allocation: 16 bytes)
(12, Colon(), Colon())

julia> @time getindex(ZARR, I...);
  0.006218 seconds (3.53 k allocations: 1.637 MiB)

julia> @time I = DimensionalData.dims2indices(ZARR, (X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))))
  0.000006 seconds (3 allocations: 8.906 KiB)
(12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  991, 992, 993, 994, 995, 996, 997, 998, 999, 1000])

julia> @time getindex(ZARR, I...);
  0.011937 seconds (108.19 k allocations: 19.329 MiB)

julia> @time I = DimensionalData.dims2indices(NC, (X(At(-1)),))
  0.000002 seconds (1 allocation: 16 bytes)
(12, Colon(), Colon())

julia> @time getindex(NC, I...);
  0.001656 seconds (171 allocations: 793.469 KiB)

julia> @time I = DimensionalData.dims2indices(NC, (X(At(-1)), Y(At(-100:-1)), Z(At(-1000:-1))))
  0.000008 seconds (3 allocations: 8.906 KiB)
(12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  991, 992, 993, 994, 995, 996, 997, 998, 999, 1000])

julia> @time getindex(NC, I...);
 77.239059 seconds (15.93 M allocations: 1.158 GiB, 0.08% gc time)

thanks! will keep digging...

rafaqz commented 7 months ago

Probably NetCDF (DiskArrays.jl) chunk loading of that vector. Likely fixed on DiskArrays.jl main already.

But either way, using At like that is not great. Why not use .. for this and get a range back? A vector throws away the structural infirmation, and At will do hundreds of lookups where .. does 2.

(DiskArrays.jl on main is just smart enough to put the range back together!!!)

felixcremer commented 7 months ago

I am wondering what is the comparison if you use Between for the ranges instead of At. Could we run into multiple scalar indexing?

rafaqz commented 7 months ago

Between is deprecated for ..

bjarthur commented 7 months ago

tried to debug with the master branch of all related packages and there are unsatisfiable requirements in the versions:

(TXYZTranscriptomeGUI) pkg> activate --temp
  Activating new project at `/var/folders/s5/8d629n5d7nsf37f60_91wzr40000gq/T/jl_hzHUzN`

(jl_hzHUzN) pkg> dev YAXArrays YAXArrayBase DimensionalData Zarr NetCDF DiskArrays
     Cloning git-repo `https://github.com/JuliaDataCubes/YAXArrayBase.jl.git`
     Cloning git-repo `https://github.com/JuliaGeo/NetCDF.jl.git`
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package DiskArrays [3c3547ce]:
 DiskArrays [3c3547ce] log:
 ├─possible versions are: 0.4.0 or uninstalled
 ├─restricted to versions 0.3 by NetCDF [30363a11] — no versions left
 │ └─NetCDF [30363a11] log:
 │   ├─possible versions are: 0.11.7 or uninstalled
 │   └─NetCDF [30363a11] is fixed to version 0.11.7
 └─DiskArrays [3c3547ce] is fixed to version 0.4.0
rafaqz commented 7 months ago

You may need to edit the Project.toml of NetCDF.jl to live on the edge like that...

And be warned, we are talking about changes merged in the last few days, here be dragons.

And the question remains: why use At like that at all, even on the old versions? Why not .. ?

bjarthur commented 7 months ago

i use At in the production code with a vector, that might have missing indices in the middle. just simplified it here for debugging purposes to use a contiguous range.

bjarthur commented 7 months ago

the netcdf file seems to be reopened for each chunk as it iterates over them! specifically, this line here:

https://github.com/meggart/DiskArrays.jl/blob/v0.3.23/src/batchgetindex.jl#L91

calls this line

https://github.com/JuliaDataCubes/YAXArrayBase.jl/blob/master/src/datasets/netcdf.jl#L29

bjarthur commented 7 months ago

not sure why the netcdf code can't mimic the zarr code, where eachchunk() could be defined as DiskArrays.eachchunk(a::NcVar) = DiskArrays.GridChunks(a,a.chunksize). the chunk size is stored in both structs:

julia> zarr.metadata.chunks
(1, 10, 100)

julia> nc.chunksize
(100, 10, 1)
rafaqz commented 7 months ago

Maybe try a PR changing it?

Also closing as this is really way outside of DDs sphere if influence ;)

felixcremer commented 7 months ago

There should be branches on all relevant packages for the DiskArrays break. Lets get them merged and revisit this next week and we can reopen this on DiskArrays or YAXArrays as needed.