Perform LMUFFT with raw convolution

hunse commented 3 years ago

Add the ability to run the impulse response convolution as a raw convolution, rather than using the FFT. In practice, I've found that this can speed things up, though it also appears to require more CPU memory (which is surprising).

I also added a profiling test.

Based on #40.

TODO:

[x] Figure out why this uses more CPU memory in some of our larger models (I have an example). Update: I looked into this, but was unable to reproduce the problem. I was trying on different hardware, though (the GPU I originally ran on was unavailable), so it's possible that's part of it. That said, I don't see any reason that the raw conv approach should use more memory (unless perhaps there's poorly-implemented explicit padding), so I'm not too worried about this.
[x] Have the profiling test assert something, or make it an optional test/benchmark
[x] Look into using shorter impulse responses when we have shorter theta. This is where this approach could really shine.

hunse commented 3 years ago

When testing this on my RTX 3060 for my specific model, I'm finding that the FFT implementation is faster than the raw conv implementation. So which one is best does seem to depend on the specific hardware/CUDA/TensorFlow. I'm hoping to test across more hardware soon, but I think for the foreseeable future, we're looking at keeping both implementations around. The best would be to autotune it, but that's probably a good chunk more work.

hunse commented 3 years ago

I think this is ready to go. In the end, I had to add two ways of doing the raw convolution, since the one that's faster on GPUs (using NCHW format) doesn't work on CPU.

hunse commented 3 years ago

Fixups look good to me. When you've got all the tests passing, feel free to merge.

tbekolay commented 3 years ago

New commits lgtm :+1:

nengo / keras-lmu

Perform LMUFFT with raw convolution #42