This is error prone, as the user can easily implement a feature in C++, export to python, and forget about the kernel. To make this more convenient, we should try to write some thin wrappers so that the user has to explicitly instantiate a class, maybe one that is used by both CPU and GPU code paths (with some typedefs), and rather than one (or more) kernels. I don't think there's a way around needing these explicit templates due to the separation of the code, but this might make it easier to remember to do.
Also, look into why the CI can fail to raise a link error when the explicit instantiation is missing. This seems strange, but maybe is related to symbol visibility?
This is error prone, as the user can easily implement a feature in C++, export to python, and forget about the kernel. To make this more convenient, we should try to write some thin wrappers so that the user has to explicitly instantiate a class, maybe one that is used by both CPU and GPU code paths (with some typedefs), and rather than one (or more) kernels. I don't think there's a way around needing these explicit templates due to the separation of the code, but this might make it easier to remember to do.
Also, look into why the CI can fail to raise a link error when the explicit instantiation is missing. This seems strange, but maybe is related to symbol visibility?