Appears ~15x faster (3200ns per proposal vs 50kns per proposal) using the water_sampling_mc.py example script with a batch size of 250.
The single proposal is slower thanks to the while loop
Some jank around the sampler kernels (not a clean abstraction, forced it to fit the use case), though would be inclined to address this later and move onto the TIBDExchangeMover batching.
water_sampling_mc.py
example script with a batch size of 250.