meggart / DiskArrays.jl

Other
72 stars 13 forks source link

map is broken, includeing missing collect_similar for DiskGenerator #144

Open felixcremer opened 7 months ago

felixcremer commented 7 months ago

The DiskGenerator is missing an implementation of collect_similar which makes the filter function not work properly. See https://github.com/yeesian/ArchGDAL.jl/issues/409 for the details.

rafaqz commented 7 months ago

map is actually broken in a lot of ways, not just collect_similar

I'm keen to try my very old idea of caching a whole column of chunks for iterate instead of a single chunk. Then iteration can follow the normal order and all of these problems go away. We can just delete DiskGenerator and DiskZip completely.

(As in your full band iteration - it works fine if we have the whole column)

meggart commented 6 months ago

I'm keen to try my very old idea of caching a whole column of chunks for iterate instead of a single chunk. Then iteration can follow the normal order and all of these problems go away. We can just delete DiskGenerator and DiskZip completely.

Won't this very easily lead to Out-Of-Memory errors for large multi-dimensional arrays where you just can not keep all chunks along the first dimension in memory? Very easy to construct examples where this fails.

Another option might just be to completely deprecate map, broadcast and reduce from DiskArrays and refer to DiskArrayEngine. Alternatively we could of course try to come up with more principled implementations, but I don't know if it would be worth the effort...

rafaqz commented 6 months ago

Does map work in DiskArrayEngine ?

It just seems to me that we are silently returning the wrong answer for map currently, which is worse than OOM errors. I am really keen to not silently return the wrong result ever.

We could always fall back to loading partial chunks if the whole column is too large for memory.