Separate Framework-Agnostic CUDA Implementation

stellarpower commented 7 months ago

Hi,

I see there are a few SDTW implementations around. I'm mainly using Keras right now, and there are a few implementations there too. The one I have used thus far is very slow to run, as I reckon it is jumping back into the CPU to interface with portions in Cython; and the rest is not very accessible to the graphcompiler, so the loops aren't unrolled in the GPU. It has to be executed eagerly.

I don't have any experience thus far in interfacing directly with CUDA, but I was wondering how possible it would be to separate out the CUDA implementation of the algorithm from the particular framework that is using it, and then just interface in, be it from Torch or Tensorflow - or potentially some other usecase. As they say, Don't Repeat Yourself.

This would probably apply to CPU use too. I have worked with Pybind11 a few times, and I don't know how well this would interface with either libTorch or tensorflow tensors, but, I presume if in CPU land whatever the graphcompilers can come up with is probably no better than a hand-rolled version in C++, if such a thing is around.

I'm guessing the differentiability would be the main thing - do you know if something like this would be possible? I'm not exactly experienced with neural networks, but I guess if I can use CudNN there must be some way to perform the backpropagation within the same kernel. In Ceres we can use expression templates to differentiate automatically, so I guess something must exist.

Any thoughts? I don't have time for sideprojects right now - but if this ends up making my network learn better, then it may well be worth the time put in.

Thanks

toinsson commented 7 months ago

hello again @stellarpower,

the meat of pysdtw really is the two CUDA kernels: functions compute_softdtw_cuda and compute_softdtw_backward_cuda. If you check the code, they only depend on cuda and math. The rest of the library is just a bunch of convenience code: pytorch integration as nn.module, support for packed sequences and availability on pypi.

After a quick googling, it appears that pytorch has a good integration with cuda, especially in python - and that is what is leveraged in pysdtw. On the other hand, I could not find any examples of calling python cuda kernels with tensorflow - only C++ (https://www.tensorflow.org/guide/create_op#use_the_op_in_python).

So, a framework agnostic python cuda implementation already exists (compute_softdtw_cuda and compute_softdtw_backward_cuda) but wrapping those might not be easy for other framework than pytorch.

stellarpower commented 7 months ago

Hi,

Right, yes, that's more or less what I mean - there are several versions out there and, be it in python or something else, I think it might be nice to separate the kernel into a separate package, for CPU and GPU, that implements the algorithm; then torch or tensorflow or some other bindings can live in their own package or repository and call in.

I have been looking and it seems there may be some ways to integrate the numba JIT code into tensorflow, but it's not looking that likely, and the standard way does seem to be in compiling an Op from C++ rather than using the ptx output and pulling that in at runtime.

Currently I'm seeing if implementing the backward pass specifically in the Keras version improves performance, as I expect the autodifferentiation is going to be the problem, but if not, it might be that I implement the algorithm in C++ and also write a kernel for it, with the aim of allowing calling in from other languages/frameworks.. Will keep you posted.

Thanks

toinsson / pysdtw

Separate Framework-Agnostic CUDA Implementation #4