How to Implement New Operators Using CUDA Host Functions Along with Thrust and CUB Libraries

As stated, the CUDA code in the candle-kernels repository seems to only contain kernel functions. When I want to implement new operators (such as nonzero), it seems I'm only able to use Rust for higher-level functionality, which means I cannot utilize the device_vector from Thrust or the flagged APIs from CUB. This poses a significant challenge for implementing my algorithms. For example, to implement nonzero, it seems I would have to reimplement algorithms like exclusive_scan and scatter using the current approach?

I am hoping for a better way to utilize the CUDA ecosystem!

Specifically, I'm interested in how to:

Incorporate host functions in CUDA code to facilitate the use of libraries like Thrust and CUB.
Effectively leverage these libraries to implement algorithms and operators that are not natively supported in the current codebase. Any guidance or best practices for achieving this would be greatly appreciated. (Translate from Chinese using LLM, Might be a little bit.. formal^_^)

huggingface / candle

How to Implement New Operators Using CUDA Host Functions Along with Thrust and CUB Libraries #2258