meggart / DiskArrays.jl

Other
66 stars 12 forks source link

unlimited dimensions in NetCDF #53

Open Alexander-Barth opened 2 years ago

Alexander-Barth commented 2 years ago

As a follow up of the discussion (https://github.com/Alexander-Barth/NCDatasets.jl/issues/79), I would like to know if unlimited dimensions (i.e. growable) could be supported by DiskArrays. I hope you it is OK for you that I make a separate issue here. It is easier for me to track what is already done and what still needs to be discussed.

The last proposition by Fabian has been:

Maybe it would help if I added a keyword argument checkbounds=false to the DiskArrays setindex implementation. So when you call setindex! in your DimensionallArrays code you can explicitly enforce bounds checking. In addition we can still discuss what the default behavior for any DiskArray should be...

I think this would be a good idea.

One could also annotate the type DiskArray with an additional parameter whether its size is fixed or not:

using Base: setindex!, size

struct FixedSize; end
struct MyArray{T,N,TFS} <: AbstractArray{T,N}; end

Base.size(A::MyArray) = (10,)

function fixedsize(::MyArray{T,N,TFS}) where {T,N,TFS}
    return TFS == FixedSize
end

function Base.setindex!(A::MyArray,value,i::Int)
    if fixedsize(A) && (i > 10)
        error("you can't do that: the array has a fixed size")
    end
    println("ok ",i)
end

# growable array
A = MyArray{Int,1,Nothing}()
A[5] = 1
A[15] = 1

# fixed-size array
Af = MyArray{Int,1,FixedSize}()
Af[5] = 1
Af[15] = 1 # error

It is decided during creation of a DiskArray if the array can grow or not and all subsequent calls can use A[index] = value rather than setindex(A,value,index,checkbounds=true/false). The compile should be able to specialize for fixed and growable arrays (there could even be two definitions of setindex! for the two cases).

CC @rafaqz, @Balinus, @visr

rafaqz commented 2 years ago

In Julia generally only vectors can grow. Multidimensional arrays can't. And with a vector, setindex! cant do this either.

I would argue that setindex! shouldn't change the array size like this for any <: AbstractArray. It breaks the ability to wrap the array in a sane way, e.g. with a Raster wrapper, as setindex! can no longer be forwarded to the parent with predictable results.

If we want to grow AbstractArrays we should define another method that can be used by users and overridden by other packages so the growing can be cleanly handled.

meggart commented 2 years ago

In Julia generally only vectors can grow. Multidimensional arrays can't. And with a vector, setindex! cant do this either.

I would argue that setindex! shouldn't change the array size like this for any <: AbstractArray. It breaks the ability to wrap the array in a sane way, e.g. with a Raster wrapper, as setindex! can no longer be forwarded to the parent with predictable results.

If we want to grow AbstractArrays we should define another method that can be used by users and overridden by other packages so the growing can be cleanly handled.

I tend to agree in general, except for the first point. For example in Zarr.jl one can resize every array along any dimension, but this does not happen through setindex! but via a separate resize! function. However, I don't think DiskArrays should prescribe the behavior of setindex! for other packages and if there is a non-breaking way to implement this, possibly through adding a trait similar to what @Alexander-Barth describes that queries which dimensions should be excluded from bounds checking during setindex!, I guess this would be in the scope of this package.

For backwards compatibility I would make the fixed size behavior the default.

Alexander-Barth commented 2 years ago

Indeed, for "plain" DiskArray, fixedsize(::DiskArray) would return always true. But for NCDatasets.AbstractVariable (my base type for variables on disk) such a function would then return false and allow the array stored on disk to be resized.

It seems to me then that all packages would be able to retain backward compatibility. I think also that the compiler should be able to specialize for both cases.

rafaqz commented 2 years ago

@meggart I didnt mean that we should never grow arrays, but thats its not the the standard API for AbstractArray. And that means we should use a method like your resize! to be clear that its different, and have that in DiskArrays.jl so other packages can extend it, or not if they dont want to.

The problem with using setfield! is the is no way to tell it it grows the array at compile time because the arguments are the same as any other setfield! call. So we would need to check return values to see if a parent array has maybe grown larger. Or we will need to define internal setfield methods in disk arrays like setfield_mayberesize!.

Adding runtime conditionals to check for arbitrary behavious is unusual in Julia, and something I would rather avoid.

Alexander-Barth commented 2 years ago

So we would need to check return values to see if a parent array has maybe grown larger.

You already need to do that with current NetCDF/DiskArrays:

using NetCDF
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NcVar("obs",[lon_dim,time_dim]);
NetCDF.create("newfile.nc",v) do nc
   @show size(v)
   nc["obs"][:,1:2] = rand(3,2)
   @show size(v)
   nc["obs"][:,1:3] = rand(3,3)
   @show size(v)
   @show supertype(typeof(v))
end;

output:

size(v) = (3, 0)
size(v) = (3, 2)
size(v) = (3, 3)
supertype(typeof(v)) = DiskArrays.AbstractDiskArray{Float64, 2}
rafaqz commented 2 years ago

I dont use NetCDF.jl and didn't realise that was possible. But its also pretty bad.

The average julia user can't tell from reading your code that the size changes - you actually need to show the size like you have to make whats happening clear.

Im arguing something like resize! should be the standard way to do this, rather than setindex!. Then we would have a explicit method for growing arrays that we can use everywhere, and whats happing is completely transparent to both users and other packages.

rafaqz commented 1 year ago

@Alexander-Barth (after a productive discussion) I just want to point you to the chunking code in DiskArrays.jl:

https://github.com/meggart/DiskArrays.jl/blob/8521669f3c7bd48acd0a0ae7817a404cfaa7faee/src/chunks.jl#L19-L23

This is an immutatble struct that has the size of the axis build into it (s). You would have to update this object on every setindex and your array struct would need to be mutable to facilitate this, or hold the chunks object in a Ref field.

If it was in resize! we could just always update this, and it could never happen with setindex! so not implementing resize! would mean a wrapper package can ignore this potential completely.

Alexander-Barth commented 1 year ago

Hey, @meggart thanks for the nice presentation today :-)

One use case that I have, is that the NetCDF file is growing while a numerically model is running. The runs can take hours (or days) and it is good to check the results while the model is running.

If we cache the size of an array, on the julia side it might get outdated. Even if there is a special API to detect a growing file on the julia side, there will still be the issue that the file is growing because a different process writes to it.

In NCDatasets, I overload size(A) which interrogates the C API. So size should always be correct (at the time is was called). If a user of NCDatasets, caches the size of an array, then the risk is an operation is done on a subset of the data (but this is ok for me). Probably NetCDF.jl works similarly.

Alexander-Barth commented 1 year ago

@rafaqz The other day, you asked me how to query if a dimension is unlimited in NCDatasets. Here is basically how you can do that:

using NCDatasets
ds = NCDataset("/tmp/test3.nc","c")
defDim(ds,"lon",10)
defDim(ds,"lat",10)
defDim(ds,"time",Inf)

unlimited(ds.dim)
# returns ["time"]

The function unlimited returns a list of strings with the names of the dimensions.

rafaqz commented 1 year ago

Thanks.

Although I think it's not usefull for Rasters.jl in the end because of the chunking problem.

We need a syntax like this that rebuilds everything (which has no real performance cost):

newraster = resize!(raster, i, j, k)

Then there is no problem with any of these packages and very little code has to change. We just handle the rebuild in that method.

Alexander-Barth commented 1 year ago

As I see the current state, DiskArrays.jl has some support for growable arrays. For an array data of dimensions lon x lat x time (where lon and lat are 10 elements long and time would be unlimited and initially zero in length) this would work data[:,:,1:10] = zeros(10,10,10) but not this data[:,:,:] = zeros(10,10,10).

I would like to have @meggart opinion, whether the currently implemented support of unlimited dimension is something that is intended to stay in DiskArrays (or do we don't know yet).

I really would like to have this cached array functionality of DiskArray in NCDatasets (even if I would need to make a breaking change for this, but still have the DiskArray's current level of support of unlimited dimensions) and reuse the code developed in DiskArray for this.

rafaqz commented 10 months ago

As I see the current state, DiskArrays.jl has some support for growable arrays

This was never followed up, but to clarify DiskArrays has never had support for growing arrays, it just didn't actively stop you (and still doesn't)

But the chunking has always been broken after the array grows.

rafaqz commented 10 months ago

A summary of the problems in setindex! working and returning correct arrays here is:

  1. Changing the array size with setindex! breaks disk array chunking
  2. Fixing it requires making the chunks field mutable and updating the chunks object in every setindex! call.
  3. setindex! is generalised here for wrappers, so for it not to break e.g. a SubDiskArray and we call setindex!, we need to update the chunks in all wrapping arrays.
  4. Otherwise we need wrappers to check all setindex! calls are inbounds at every stage to disallow the behavior.
  5. This package is already hard to understand because lazy chunking is a difficult thing, adding mutability everywhere will make maintenance harder.
  6. Its not following the Base array interface, so we would both be doing more work and breaking widely used Julia standards

But we really need to resolve this problem for the ecosystem to move forwards, there is a lot of work hanging on this now.

So, I've been thinking about other ways to instead make NCDatasets.jl setindex! still grow the array without having to switch everything here to being mutable. And it is possible:

  1. The netcdf Variable and CFVariable must be mutable and probably wrap another diskarray.
  2. we need a resize! interface here that can updates chunks when the array size changes, and returns the whole array with updated chunks rather than the region that was set.
  3. Otherwise a resize method that just resizes the chunks, but then NCDatasets.jl would have to handle updating.
  4. NCDatasets.jl defines its own setindex!(::Variable, args...) methods that has a check for out-of-bounds access.
  5. When the indices are out of bounds, NCDatasets.setindex! calls DiskArrays.resize! instead of setindex! to write.
  6. the return value is used as the new internal disk array, which has correctly updated chunks.

This will break if there are outer wrappers (like ReshapedArray) around the Variable because their chunks wont update at the same time. But We can just document the limitations.

@meggart can you see any problems with that? and @Alexander-Barth would that work for you?

I'm happy to implement resize/resize! here to get this moving.

Alexander-Barth commented 10 months ago

I don't know, this all sounds quite complicated. There is currently no special case for growing a NetCDF variable in NCDatasets (and presumable also not in NetCDF.jl) as all is handled transparently by the C library. I am not familiar with DiskArrays, but your proposal sound like NCDatasets would need to modify internal (non-documented) fields of DiskArrays

Do you need other functionality besides the iterator eachchunk and haschunks (at least initially)?

Maybe a different approach would be to have an abstract type with these methods associated to it defined in a minimal module.

For eachchunk , it would be nice to have also the method eachchunk(array,maximum_memory_for_each_chunk_in_bytes) (to avoid using a global variable).

I am also wondering if eachchunk should return the indices of the chunk or the chunk itself (as a view to the parent array).

I have already some prototype code for this for NCDatasets.

The advantage I can see, is that I am the only one who has to deal with resizable arrays (and I am not forcing this complexity to anybody else).

rafaqz commented 10 months ago

I don't know, this all sounds quite complicated.

This is maybe an order of magnitude less complicated than supporting this generically in DiskArrays! Hopefully you can understand that from our perspective too. If you are happy to have broken chunks after setindex! you don't need to do it at all, but, it feels to me like sooner or later that will cause issues for users who expect it to work properly.

To help we can write a method in DiskArrays.jl to rebuild the chunks with the new indices, so you will just have to update the chunks field of your variable with something like:

v.chunks = DiskArrays.resize_chunks(v.chunks, I...)

Which could return a new GridChunks object. Would that help?

This method is mostly used as a fallback as the chunks are precomputed:

eachchunk(array,maximum_memory_for_each_chunk_in_bytes)

But it that case its a good idea, maybe make an issue for tracking it?

Do you need other functionality besides the iterator eachchunk and haschunks (at least initially)?

I'm not totally sure oin what context you mean here. But I think all that needs to be updated in your setindex! is what you will return from eachchunk, which means a new GridChunks object which our resize_chunks or whatever method can supply for you, or you can handle that manually. You could also calculate it lazily whenever eachchunk is called, but I'm not sure what the overheads of that will be.

I am also wondering if eachchunk should return the indices of the chunk or the chunk itself (as a view to the parent array).

eachchunk returns a GridChunks object which is an iterator over the chunk indices, rather than over the data. Its a widely used API now so we can't change it without breaking all the other packages.