Another thing to consider here, which @almaeder brought up the other day:
[x] Since we opted for block-wise contiguous storage we can cache the requested block data slice instead of always naively recomputing the mask. (From playing around a little bit, i estimate the total cache size should be on the order of kB and the speedup in case of cache hit should be very noticeable :sweat_smile: )