Open Alexander-Barth opened 2 years ago
In Julia generally only vectors can grow. Multidimensional arrays can't. And with a vector, setindex!
cant do this either.
I would argue that setindex!
shouldn't change the array size like this for any <: AbstractArray
. It breaks the ability to wrap the array in a sane way, e.g. with a Raster
wrapper, as setindex!
can no longer be forwarded to the parent with predictable results.
If we want to grow AbstractArrays we should define another method that can be used by users and overridden by other packages so the growing can be cleanly handled.
In Julia generally only vectors can grow. Multidimensional arrays can't. And with a vector,
setindex!
cant do this either.I would argue that
setindex!
shouldn't change the array size like this for any<: AbstractArray
. It breaks the ability to wrap the array in a sane way, e.g. with aRaster
wrapper, assetindex!
can no longer be forwarded to the parent with predictable results.If we want to grow AbstractArrays we should define another method that can be used by users and overridden by other packages so the growing can be cleanly handled.
I tend to agree in general, except for the first point. For example in Zarr.jl one can resize every array along any dimension, but this does not happen through setindex!
but via a separate resize!
function. However, I don't think DiskArrays should prescribe the behavior of setindex!
for other packages and if there is a non-breaking way to implement this, possibly through adding a trait similar to what @Alexander-Barth describes that queries which dimensions should be excluded from bounds checking during setindex!
, I guess this would be in the scope of this package.
For backwards compatibility I would make the fixed size behavior the default.
Indeed, for "plain" DiskArray, fixedsize(::DiskArray)
would return always true. But for NCDatasets.AbstractVariable
(my base type for variables on disk) such a function would then return false and allow the array stored on disk to be resized.
It seems to me then that all packages would be able to retain backward compatibility. I think also that the compiler should be able to specialize for both cases.
@meggart I didnt mean that we should never grow arrays, but thats its not the the standard API for AbstractArray
. And that means we should use a method like your resize!
to be clear that its different, and have that in DiskArrays.jl so other packages can extend it, or not if they dont want to.
The problem with using setfield!
is the is no way to tell it it grows the array at compile time because the arguments are the same as any other setfield!
call. So we would need to check return values to see if a parent array has maybe grown larger. Or we will need to define internal setfield methods in disk arrays like setfield_mayberesize!
.
Adding runtime conditionals to check for arbitrary behavious is unusual in Julia, and something I would rather avoid.
So we would need to check return values to see if a parent array has maybe grown larger.
You already need to do that with current NetCDF/DiskArrays:
using NetCDF
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NcVar("obs",[lon_dim,time_dim]);
NetCDF.create("newfile.nc",v) do nc
@show size(v)
nc["obs"][:,1:2] = rand(3,2)
@show size(v)
nc["obs"][:,1:3] = rand(3,3)
@show size(v)
@show supertype(typeof(v))
end;
output:
size(v) = (3, 0)
size(v) = (3, 2)
size(v) = (3, 3)
supertype(typeof(v)) = DiskArrays.AbstractDiskArray{Float64, 2}
I dont use NetCDF.jl and didn't realise that was possible. But its also pretty bad.
The average julia user can't tell from reading your code that the size changes - you actually need to show the size like you have to make whats happening clear.
Im arguing something like resize!
should be the standard way to do this, rather than setindex!
. Then we would have a explicit method for growing arrays that we can use everywhere, and whats happing is completely transparent to both users and other packages.
@Alexander-Barth (after a productive discussion) I just want to point you to the chunking code in DiskArrays.jl:
This is an immutatble struct that has the size of the axis build into it (s
). You would have to update this object on every setindex
and your array struct would need to be mutable
to facilitate this, or hold the chunks object in a Ref
field.
If it was in resize!
we could just always update this, and it could never happen with setindex!
so not implementing resize!
would mean a wrapper package can ignore this potential completely.
Hey, @meggart thanks for the nice presentation today :-)
One use case that I have, is that the NetCDF file is growing while a numerically model is running. The runs can take hours (or days) and it is good to check the results while the model is running.
If we cache the size of an array, on the julia side it might get outdated. Even if there is a special API to detect a growing file on the julia side, there will still be the issue that the file is growing because a different process writes to it.
In NCDatasets, I overload size(A)
which interrogates the C API. So size
should always be correct (at the time is was called). If a user of NCDatasets, caches the size of an array, then the risk is an operation is done on a subset of the data (but this is ok for me). Probably NetCDF.jl works similarly.
@rafaqz The other day, you asked me how to query if a dimension is unlimited in NCDatasets. Here is basically how you can do that:
using NCDatasets
ds = NCDataset("/tmp/test3.nc","c")
defDim(ds,"lon",10)
defDim(ds,"lat",10)
defDim(ds,"time",Inf)
unlimited(ds.dim)
# returns ["time"]
The function unlimited
returns a list of strings with the names of the dimensions.
Thanks.
Although I think it's not usefull for Rasters.jl in the end because of the chunking problem.
We need a syntax like this that rebuilds everything (which has no real performance cost):
newraster = resize!(raster, i, j, k)
Then there is no problem with any of these packages and very little code has to change. We just handle the rebuild in that method.
As I see the current state, DiskArrays.jl
has some support for growable arrays. For an array data
of dimensions lon x lat x time (where lon and lat are 10 elements long and time would be unlimited and initially zero in length) this would work data[:,:,1:10] = zeros(10,10,10)
but not this data[:,:,:] = zeros(10,10,10)
.
I would like to have @meggart opinion, whether the currently implemented support of unlimited dimension is something that is intended to stay in DiskArrays (or do we don't know yet).
I really would like to have this cached array functionality of DiskArray in NCDatasets (even if I would need to make a breaking change for this, but still have the DiskArray's current level of support of unlimited dimensions) and reuse the code developed in DiskArray for this.
As I see the current state, DiskArrays.jl has some support for growable arrays
This was never followed up, but to clarify DiskArrays has never had support for growing arrays, it just didn't actively stop you (and still doesn't)
But the chunking has always been broken after the array grows.
A summary of the problems in setindex!
working and returning correct arrays here is:
setindex!
breaks disk array chunkingmutable
and updating the chunks object in every setindex!
call.setindex!
is generalised here for wrappers, so for it not to break e.g. a SubDiskArray
and we call setindex!
, we need to update the chunks in all wrapping arrays.setindex!
calls are inbounds at every stage to disallow the behavior.But we really need to resolve this problem for the ecosystem to move forwards, there is a lot of work hanging on this now.
So, I've been thinking about other ways to instead make NCDatasets.jl setindex!
still grow the array without having to switch everything here to being mutable. And it is possible:
Variable
and CFVariable
must be mutable
and probably wrap another diskarray.resize!
interface here that can updates chunks when the array size changes, and returns the whole array with updated chunks rather than the region that was set. resize
method that just resizes the chunks, but then NCDatasets.jl would have to handle updating.setindex!(::Variable, args...)
methods that has a check for out-of-bounds access.NCDatasets.setindex!
calls DiskArrays.resize!
instead of setindex!
to write.This will break if there are outer wrappers (like ReshapedArray
) around the Variable because their chunks wont update at the same time. But We can just document the limitations.
@meggart can you see any problems with that? and @Alexander-Barth would that work for you?
I'm happy to implement resize/resize!
here to get this moving.
I don't know, this all sounds quite complicated. There is currently no special case for growing a NetCDF variable in NCDatasets (and presumable also not in NetCDF.jl) as all is handled transparently by the C library. I am not familiar with DiskArrays, but your proposal sound like NCDatasets would need to modify internal (non-documented) fields of DiskArrays
Do you need other functionality besides the iterator eachchunk
and haschunks
(at least initially)?
Maybe a different approach would be to have an abstract type with these methods associated to it defined in a minimal module.
For eachchunk
, it would be nice to have also the method eachchunk(array,maximum_memory_for_each_chunk_in_bytes)
(to avoid using a global variable).
I am also wondering if eachchunk
should return the indices of the chunk or the chunk itself (as a view to the parent array).
I have already some prototype code for this for NCDatasets.
The advantage I can see, is that I am the only one who has to deal with resizable arrays (and I am not forcing this complexity to anybody else).
I don't know, this all sounds quite complicated.
This is maybe an order of magnitude less complicated than supporting this generically in DiskArrays! Hopefully you can understand that from our perspective too. If you are happy to have broken chunks after setindex!
you don't need to do it at all, but, it feels to me like sooner or later that will cause issues for users who expect it to work properly.
To help we can write a method in DiskArrays.jl to rebuild the chunks with the new indices, so you will just have to update the chunks field of your variable with something like:
v.chunks = DiskArrays.resize_chunks(v.chunks, I...)
Which could return a new GridChunks
object. Would that help?
This method is mostly used as a fallback as the chunks are precomputed:
eachchunk(array,maximum_memory_for_each_chunk_in_bytes)
But it that case its a good idea, maybe make an issue for tracking it?
Do you need other functionality besides the iterator eachchunk and haschunks (at least initially)?
I'm not totally sure oin what context you mean here. But I think all that needs to be updated in your setindex!
is what you will return from eachchunk
, which means a new GridChunks
object which our resize_chunks
or whatever method can supply for you, or you can handle that manually. You could also calculate it lazily whenever eachchunk
is called, but I'm not sure what the overheads of that will be.
I am also wondering if eachchunk should return the indices of the chunk or the chunk itself (as a view to the parent array).
eachchunk
returns a GridChunks
object which is an iterator over the chunk indices, rather than over the data. Its a widely used API now so we can't change it without breaking all the other packages.
As a follow up of the discussion (https://github.com/Alexander-Barth/NCDatasets.jl/issues/79), I would like to know if unlimited dimensions (i.e. growable) could be supported by DiskArrays. I hope you it is OK for you that I make a separate issue here. It is easier for me to track what is already done and what still needs to be discussed.
The last proposition by Fabian has been:
I think this would be a good idea.
One could also annotate the type
DiskArray
with an additional parameter whether its size is fixed or not:It is decided during creation of a DiskArray if the array can grow or not and all subsequent calls can use
A[index] = value
rather thansetindex(A,value,index,checkbounds=true/false)
. The compile should be able to specialize for fixed and growable arrays (there could even be two definitions ofsetindex!
for the two cases).CC @rafaqz, @Balinus, @visr