Integrate transpose inside iterators

After some discussion, we think that it would be nice to integrate the transpose array functionality inside the iarray_eval_iterblosc(). This would allow:

Transpose operands (e.g. views) in expressions on-the-flight.
Use prefilters for allowing parallel transpose operation.

We can also think on whether we can implement a similar concept for read iterators, so that a transposed view can be materialized just by using the iterator. For doing that we would need the postfilter support in Blosc2 so as to avoid copies.

A limitation of this proposal is that, for maximum performance (i.e. parallelism of the prefilter and avoid copies), the out_chunkshape must be equal than chunkshape. But such a case should be quite usual, in general.

After promoting the priority of transposition (as matmul + transpose can be useful for some simple use cases of deep learning) we have had another discussion and we came to the next considerations:

1) Implementing a true transposed view and reinitroducing the existing code for computing the transpose on-the flight via iarray_get_slice_buffer() should be low-hanging fruit. That means that we could immediately use transposed views in both expressions and matmul. The downside is that we cannot use parallelism when computing the iarray_get_slice_buffer (); however, this is not too bad, because e.g. for matmul, the bottleneck is in the MKL evaluation, not in the iarray_get_slice_buffer () operations.

2) We could implement the paralel evaluation of a physical (by oposition to view) tranpose by using the existing iarray_eval_iterblosc() in combination with a blosc prefilter and an enhanced version of iarray_get_slice_buffer(). However, this should be mostly useful just for benchmarking and demo purposes. In general, people should prefer to use transpose as views and use it in expressions or matmul operations.

3) We could implement a full-fledged parallel version of iarray_get_slice_buffer() that could make iarray_eval_iterblosc() and iarray_iter_read_*() compute transpositions in parallel. However, this would require the support of post-filters in C-Blosc2 (not yet implemented) and also a quite big improvement of iarray_get_slice_buffer(). Also, there is no guarantee that a parallel iarray_get_slice_buffer() would necessarily accelerate expression or matmul evaluations in a significant amount.

Due to that, we would like to implement 1) (the low-hanging fruit) for ironArray 1.0 and consider implementing 2) and 3) for 2.0. I'll file a new ticket for implementing just 1).

inaos / iron-array

Integrate transpose inside iterators #353