NOT A PRIORITY. Right now general reorderings are split in three simpler ones. Every one of the simpler reorderings copies data from the tensor to scratch and back, so that each of the reorderings is complete and puts the data in the tensor data array. However, each simple reordering doesn't have to be complete in this sense, but only the end result of the three (or fewer, sometimes) simple reorderings. For this reason, a few copying parts of the reordering can be neglected, which can speed it up up to a 20-30% I estimate.
NOT A PRIORITY. Right now general reorderings are split in three simpler ones. Every one of the simpler reorderings copies data from the tensor to scratch and back, so that each of the reorderings is complete and puts the data in the tensor data array. However, each simple reordering doesn't have to be complete in this sense, but only the end result of the three (or fewer, sometimes) simple reorderings. For this reason, a few copying parts of the reordering can be neglected, which can speed it up up to a 20-30% I estimate.