i am testing for gl64 NTT with log_n_size=17, under concurent environment.
i observed that the data copy from host to device ranges from 20us to 6ms. i think the underlying code does not utilise aync. the last line gpu.sync() will block CPU.
Here is the problem. For example 20us is obviously impossible, which indicates faulty methodology and misunderstanding of some basics. And, again, we don't have resources to correct that...
i am testing for gl64 NTT with log_n_size=17, under concurent environment. i observed that the data copy from host to device ranges from 20us to 6ms. i think the underlying code does not utilise aync. the last line gpu.sync() will block CPU.
or, it would be better to provide a batch NTT function