meggart / DiskArrays.jl

Other
78 stars 15 forks source link

DiskArrays.jl vs BlockArrays.jl #23

Open Luapulu opened 3 years ago

Luapulu commented 3 years ago

Hey,

this is probably old news for you, but I just discovered BlockArrays.jl. If I understand correctly, BlockArrays.jl implements a lot of, if not all the features we need for DiskArrays.jl.

Does this make DiskArrrays.jl obsolote?

Sorry, if I'm going through the same thought process you already covered long ago, but may this can help.

meggart commented 3 years ago

I think DiskArrays and BlockArrays started with very different goals in mind although in the end resulted with partly similar design.

Originally DiskArrays mission was only to get indexing right for a variety of packages that implement AbstractArray behavior for arrays that are mapped to disk (see this thread https://discourse.julialang.org/t/taking-the-array-indexing-interface-seriously/32035). So we started out to get these trailing/missing indices unified so that not every package had to implement its own complicated getindex/setindex. Actually this should not be a problem because the AbstractArray interface should take care of this, but unfortunately it the interface assumes low-latency random access to the array which is not true for these arrays.

Notice there was not a notion of chunks so far in the package. Only when we wanted to go a bit further and implement some mapreduce/broadcast behavior it became necessary to think about chunks. Here I would have loved to use some kind of interface that an array could implement (e.g. what I proposed in ChunkedArrayBase) but did not find something that was general enough and had out-of-core data in mind. I experimented with BlockArrays, but found that its scope was quite different from what we do in DiskArrays.

Note that last time I looked BlockArrays it could not really deal with out-of-core data and was more or less still assuming fast random access through caching or through holding arrays in memory. I understood its main purpose was to support very sparse arrays by just defining the blocks that actually hold some data (banded matrices). Would be good to know if it has been extended now to support our use cases, but I would still be sceptical.

So, I would be very surprised if we could simply make an HDF5Dataset an AbstractBlockArray and would have fast indexing and broadcasting etc, but maybe it would work. However, if you want to try to reuse some of the types like BlockIndex and some iteration over Blocks, I think this would be a good idea.