Open ChrisRackauckas opened 2 years ago
I did profile some approaches, but hadn't settled on a final design. It turned out, that the preallocating version was slower than the current version and at that point I decided to postpone this (see e.g. https://github.com/JuliaLang/julia/issues/39566). I'm still not entirely sure, whether it would be worth the effort.
OpenBLAS has weird results with Ryzen. I would benchmark with MKL to get a better view of the real performance and let people use libblastrampoline on Ryzen.
MKL shows the same behaviour for that example.
This came up with the discussions with @frankschae. Many of these core functions allocate. Having a version which allows for pre-building the cache and then reusing that cache can help with performance, particularly in multithreaded contexts.