mcabbott / Tullio.jl

MIT License
615 stars 28 forks source link

Way to Switch Between LoopVectorization+TensorOperations Dynamically #126

Open ParadaCarleton opened 3 years ago

ParadaCarleton commented 3 years ago

Correct me if I'm wrong, but I believe that LoopVectorization does better than TensorOps for small matrices, while @tensor outperforms for large matrices because it calls BLAS. Would it maybe be possible to have a macro that switches between @tullio and @tensor depending on which is more efficient?

mcabbott commented 3 years ago

Right, @tensor will often win, on cases they both handle, on large arrays. But not always -- sometimes it needs to call permutedims which is quite expensive, while @tullio can fuse that into the multiplication.

To automatically choose the best, you'd need some cost model, and I don't know how to write that. Or, like FFTW, you could try both on the first run of a given size, on a given machine. These both seem pretty complicated, though!

ParadaCarleton commented 3 years ago

Right, @tensor will often win, on cases they both handle, on large arrays. But not always -- sometimes it needs to call permutedims which is quite expensive, while @tullio can fuse that into the multiplication.

To automatically choose the best, you'd need some cost model, and I don't know how to write that. Or, like FFTW, you could try both on the first run of a given size, on a given machine. These both seem pretty complicated, though!

Is there not a way to check ahead of time whether the operation would require permutedims? If that's the case, I think you can use some reasonable heuristics. For instance, use @tensor if the operation would require permutedims and the tensor has more than 2,000 elements; otherwise, use @tullio.

chriselrod commented 3 years ago

Could check array sizes vs number of cores * L2 cache size per core.

ParadaCarleton commented 3 years ago

@mcabbott any thoughts on the above?

mcabbott commented 3 years ago

Not really, I'm sorry!

I guess I don't think very simple heuristics would work all that well, and that anything more complicated would need a lot of tuning, which doesn't sound like much fun. And is one more layer of complexity.

For example, there's a folder full of benchmarks from last year. TensorOperations does some clever stuff for permutedims, which results in it being 5x faster at some intermediate sizes (for some permutations): https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/permute4-0.2.6-1.5.2.png I don't have a model for that, and I'm sure it would have to depend on a lot of details about the machine. Let alone a model which could extrapolate to all combinations of weird operations.

I think the FFTW-style test-and-remember would actually be much easier to write. At the cost of some startup (or upgrade) latency. I wonder if anyone has made a package to do that? Like Memoize but for algorithms.

chriselrod commented 3 years ago

Extending LV to do smart Octavian-style blocking also wouldn't be all that hard, so I think effort would be better spent on the right solution than more hacks to better pick between mediocre existing solutions.