meggart / DiskArrays.jl

Other
73 stars 13 forks source link

Implement chunked `foreach`/non allocating chunked iteration. #37

Open rafaqz opened 3 years ago

rafaqz commented 3 years ago

As pointed out in #15 broadcast can do pretty much everything. But one issue is you have to call collect and allocate an array?

In GeoData.jl I'm using broadcast to iterate over arrays to e.g. calculate ranges of non nodata values for a trim method, using broadcast so the operation is chunked. But I still need to call collect on the broadcast object, which allocates.

It would be good if there was a way to iterate over the object with chunking that didn't allocate. Maybe that's possible already?

rafaqz commented 3 years ago

Maybe we can achieve this in a more general way by ~using Iterators.Stateful to iterate over the chunks,~ caching the chunks along the column as we load them and swapping them out as required. An iterator would have to be in the correct order, unlike broadcast.

On the first iteration we can allocate the required memory for the number of chunks along the column, and copy to it from disk when the iterator gets to the next chunk.

This may also resolve issues with methods like replace, which currently index linearly and hang.

meggart commented 2 years ago

I am still not sure I completely understand the issue here. What exactly is the use case for iteration here that can not be accomplished using reduce or reducedim? You can also do reductions over broadcasted objects, so it should not be necessary to allocate the full array.