Open ParadaCarleton opened 3 years ago
Right, @tensor
will often win, on cases they both handle, on large arrays. But not always -- sometimes it needs to call permutedims
which is quite expensive, while @tullio
can fuse that into the multiplication.
To automatically choose the best, you'd need some cost model, and I don't know how to write that. Or, like FFTW, you could try both on the first run of a given size, on a given machine. These both seem pretty complicated, though!
Right,
@tensor
will often win, on cases they both handle, on large arrays. But not always -- sometimes it needs to callpermutedims
which is quite expensive, while@tullio
can fuse that into the multiplication.To automatically choose the best, you'd need some cost model, and I don't know how to write that. Or, like FFTW, you could try both on the first run of a given size, on a given machine. These both seem pretty complicated, though!
Is there not a way to check ahead of time whether the operation would require permutedims
? If that's the case, I think you can use some reasonable heuristics. For instance, use @tensor
if the operation would require permutedims
and the tensor has more than 2,000 elements; otherwise, use @tullio
.
Could check array sizes vs number of cores * L2 cache size per core.
@mcabbott any thoughts on the above?
Not really, I'm sorry!
I guess I don't think very simple heuristics would work all that well, and that anything more complicated would need a lot of tuning, which doesn't sound like much fun. And is one more layer of complexity.
For example, there's a folder full of benchmarks from last year. TensorOperations does some clever stuff for permutedims
, which results in it being 5x faster at some intermediate sizes (for some permutations):
https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/permute4-0.2.6-1.5.2.png
I don't have a model for that, and I'm sure it would have to depend on a lot of details about the machine. Let alone a model which could extrapolate to all combinations of weird operations.
I think the FFTW-style test-and-remember would actually be much easier to write. At the cost of some startup (or upgrade) latency. I wonder if anyone has made a package to do that? Like Memoize but for algorithms.
Extending LV to do smart Octavian-style blocking also wouldn't be all that hard, so I think effort would be better spent on the right solution than more hacks to better pick between mediocre existing solutions.
Correct me if I'm wrong, but I believe that LoopVectorization does better than TensorOps for small matrices, while
@tensor
outperforms for large matrices because it calls BLAS. Would it maybe be possible to have a macro that switches between@tullio
and@tensor
depending on which is more efficient?