WebGPU Compute shader execution?

While looking into the WebGPU backend and execution example I am left with a few questions.

I am currently working on porting the Open Image Denoise models to work on Tensorflow.js. This has been done by someone already but they aren't up to sharing yet. This has also been done with Cuda Compute and some Rust/HLSL compute shaders.

The pipeline currently would be:

Go from CPU to GPU Execute main rendering pipeline with compute, vertex, fragment shaders to get noisy image.
Push image onto shared tensor storage buffer as input (GPU).
Return to CPU
Execute Tensorflow.js (with buffer that is still on the GPU)
Tensorflow runs UFilter denoise on buffer and stores the new image data in the storage buffer (still on GPU)
Return to CPU
Issue fullscreen quad renderpass using the image on the storage buffer (GPU)
Return to CPU

With native libraries you can execute OIDN directly in compute shaders (but is a total pain to setup) and other examples (CUDA/HLSL) also execute the DNN on the compute shaders without the CPU return.

I am curious to see if there is any existing methods to reduce the round trips between the CPU/GPU. Even simply executing the tensorflow process from a compute shader would be massive as the only thing returned to the CPU would be the final buffer...

I don't have the skills nor understanding yet to make this work, but is something I figured I would ask.

tensorflow / tfjs

WebGPU Compute shader execution? #8314