Closed hunse closed 3 years ago
When testing this on my RTX 3060 for my specific model, I'm finding that the FFT implementation is faster than the raw conv implementation. So which one is best does seem to depend on the specific hardware/CUDA/TensorFlow. I'm hoping to test across more hardware soon, but I think for the foreseeable future, we're looking at keeping both implementations around. The best would be to autotune it, but that's probably a good chunk more work.
I think this is ready to go. In the end, I had to add two ways of doing the raw convolution, since the one that's faster on GPUs (using NCHW format) doesn't work on CPU.
Fixups look good to me. When you've got all the tests passing, feel free to merge.
New commits lgtm :+1:
Add the ability to run the impulse response convolution as a raw convolution, rather than using the FFT. In practice, I've found that this can speed things up, though it also appears to require more CPU memory (which is surprising).
I also added a profiling test.
Based on #40.
TODO: