Closed yshekel closed 3 weeks ago
looks fine but why not use 128 for all cases? I'm not aware of any benefit of using large blocks.
I also was wondering that, but I assumed maybe it performed better so that's why it as done like that. Also I would have to measure and see that I did not degrade anything so wanted to avoid that right now.
This PR solves an issue for large ecntt where cuda blocks are too large and cannot be assigned to SMs. The fix is to reduce thread count per block and increase block count in that case.