meggart / DiskArrays.jl

Other
72 stars 13 forks source link

Don't load whole chunk when you don't need to, e.g. with NetCDF #136

Closed rafaqz closed 6 months ago

rafaqz commented 7 months ago

We can likely get around this problem by skipping batchgetindex for small indexing operations. We may also need to facilitate easy backend customisation of this behaviour.

Originally posted by @Alexander-Barth in https://github.com/meggart/DiskArrays.jl/issues/131#issuecomment-1777553274

In the worst-case scenario (reading a time series for spatial chunked data), one would read a single value from each chunk:

Test with NCDatasets 0.12.17:

using BenchmarkTools                                                                                                                                                  
using NCDatasets                                                                                                                                                      

sz = (1024,1024,64)                                                                                                                                                   
ds = NCDataset("/tmp/tmp.nc","c")                                                                                                                                     
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               

defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                               
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            

ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           

@btime ds["data"][1,1,1];                                                                                                                                             
@btime ds["data"][:,:,1];                                                                                                                                             

@btime ds["datac"][1,1,1];                                                                                                                                            
@btime ds["datac"][:,:,1];

output:

julia> @btime ds["data"][1,1,1];                                                                                                                                      
  15.069 μs (34 allocations: 1.62 KiB)                                                                                                                                

julia> @btime ds["data"][:,:,1];                                                                                                                                      
  1.014 ms (35 allocations: 8.00 MiB)                                                                                                                                 

julia> @btime ds["datac"][1,1,1];                                                                                                                                     
  13.518 μs (34 allocations: 1.62 KiB)                                                                                                                                

julia> @btime ds["datac"][:,:,1];                                                                                                                                     
  470.025 μs (35 allocations: 8.00 MiB)                                                                                                                               

So reading an entire chunk (if you do not need it) is 67 times slower for uncompressed data and 36 times slower for compressed data. But we read the same data multiple times in @btime. The netCDF library is likely cache the data. Some application might indeed read the same data multiple times but this is less typical. If the data is read for the first time, the differences are not as large:

sz = (1024,1024,64)                                                                                                                                                   
isfile("/tmp/tmp3.nc") && rm("/tmp/tmp3.nc")                                                                                                                          
ds = NCDataset("/tmp/tmp3.nc","c")                                                                                                                                    
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               

defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                              
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            

ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           
close(ds)                                                                                                                                                             

ds = NCDataset("/tmp/tmp3.nc")                                                                                                                                        
# warm-up                                                                                                                                                             
ds["data"][1,1,1];                                                                                                                                                    
ds["data"][:,:,2];                                                                                                                                                    

@time ds["data"][1,1,3];                                                                                                                                              
@time ds["data"][:,:,4];                                                                                                                                              

@time ds["datac"][1,1,3];                                                                                                                                             
@time ds["datac"][:,:,4];           

output

  0.001231 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.001682 seconds (35 allocations: 8.002 MiB)                                                                                                                        
  0.031734 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.038516 seconds (35 allocations: 8.002 MiB)                                                                                                                        

So a difference of about 20 % in this case.

For:

@time ds["data"][512,512,3];
@time ds["data"][:,:,4];

@time ds["datac"][512,512,3];
@time ds["datac"][:,:,4];

I get (loading the first time)

  0.026237 seconds (36 allocations: 1.656 KiB)
  0.045077 seconds (35 allocations: 8.002 MiB)
  0.033074 seconds (36 allocations: 1.656 KiB)
  0.038558 seconds (35 allocations: 8.002 MiB)

If the data is in the libnetcdf chunk cache (using @btime), I get:

julia> @btime ds["data"][512,512,3];
  15.549 μs (36 allocations: 1.66 KiB)

julia> @btime ds["data"][:,:,4];
  421.606 μs (35 allocations: 8.00 MiB)

julia> @btime ds["datac"][512,512,3];
  15.361 μs (36 allocations: 1.66 KiB)

julia> @btime ds["datac"][:,:,4];
  417.527 μs (35 allocations: 8.00 MiB)

Feel free to make other benchmarks with other indices. Another use case is to load data over the network via opendap or HTTP byte ranges request, you certainly do not want to transfer more data than necessary.

rafaqz commented 7 months ago

@Alexander-Barth I think a solution to this is simply adding a trait for the different backend behaviours.

Like DiskArrays.hasinternalcache(x) = false and NCDatasets.jl can set it to true.

Alexander-Barth commented 7 months ago

I am not sure if this issue is really separate from issue: https://github.com/meggart/DiskArrays.jl/issues/135 In both cases, we load the whole chunk when only a subset is asked for. It is my understanding that for HDF5, the netcdf library caches decompressed chunks, but I do not think that this is the case NetCDF 3 or OPeNDAP.

If when a user ask ds["datac"][:,:,4], it would just call DiskArrays.readblock!(a,aout,:,:,4) we would be able to solve both issues and make the interface clearer in my opinion.

rafaqz commented 7 months ago

They may not be different for you or NCDatasets.jl... but they are generally because zarr needs a sparse indexing optimisation for indices with large gaps in between, but must also load whole blocks at a time.

So from your netcdf perspective they are solved in the same way, but for zarr or gdal they need different, actually opposite strategies.

This is why we need a trait to split up these two modes of disk array.

Alexander-Barth commented 7 months ago

I guess that it is me not understanding what will be different when setting DiskArrays.hasinternalcache(x) to true or false. When it is false, read operations will always load complete chunks, and when it is true we will only load what is actually requested?

rafaqz commented 7 months ago

That, and that we would manually strip sparse indices to the specifuc required chunks, vs e.g. just letting netcdf handle that.

Alexander-Barth commented 7 months ago

Reading a variable with a non-units stride (e.g. 1:2:10), this was solved quite elegantly at the time by using multiple-dispatch by @meggart . See here a discussion:

https://github.com/Alexander-Barth/NCDatasets.jl/issues/79#issuecomment-588870602

Maybe we would need a similar approach for indices as AbstractVectors? If a client library does not provide a way to load an array with vectors as indices, then use the current approach by loading whole chunks is used. But if a client library provides such a function (DiskArrays.readblock!), then use that.

It was not necessary to implement a function like strided_getindex. So maybe batchgetindex can remain an internal function?

meggart commented 6 months ago

I think this is solved now through #146 . When you set the batchstrategy to SubRanges with a density threshold of 1.0 you will get exactly the behavior that NCDatasets implements right now, i.e. there will never be more data read than necessary. This example from the unit tests shows which readblock! calls exactly will be executed for the given index:

a_inner = rand(100)
inds_sorted = [1,1,1,3,5,6,6,7,10,13,16,16,19,20]
inds_unsorted = [7, 5, 1, 16, 1, 10, 20, 6, 19, 1, 13, 6, 3, 16]
a = AccessCountDiskArray(a_inner,chunksize=(5,),batchstrategy=DiskArrays.SubRanges(CanStepRange(),0.8));
b1 = a[inds_sorted];
@test b1 == a_inner[inds_sorted]
@test sort(getindex_log(a)) == [(1:2:5,), (6:7,), (10:3:19,), (20:20,)]
empty!(a.getindex_log)
b2 = a[inds_unsorted]
@test b2 == a_inner[inds_unsorted]
@test sort(getindex_log(a)) == [(1:2:5,), (6:7,), (10:3:19,), (20:20,)]

I have made a NCDatasets branch where this trait is implemented for a Variable here https://github.com/meggart/NCDatasets.jl/tree/fg/diskarrays_04 . @Alexander-Barth is this ok or do you have other test cases in mind that we might not yet have covered?