Closed stephen-hqxu closed 3 years ago
Currently multithreading support for biome generator has been removed completely, due to the unknown reason that causes low performance on multiple threads.
After some painful investigation, the problem appears to be the synchornisation. When one thread access the cache, it will start doing recursive calls, thus blocking all other threads.
I have made a prototype system and push to test. There will be a huge performance improvement if we allocate separate cache for each thread, and remove all sync technology. However unfortunately this is still relatively slower than single threaded version for some unknown reasons. The more threads are used, the worse the performane is.
Some research suggested that it might due to out of cache size since all threads are doing heave recursion, and thus turning to do memory R/W, which trashes the performance.
In the end, I will just discard my multithreading design, and sticks with the classic minecraft single threaded design.
However if I did manage to find a proper solution for that, I will turn back to MT in the future :)
Issues
Biome generation algorithm was originally implemented on device side. However after some testing the performance is really bad, the program is totally not runnable when the resolution of the texture exceeds 256^2. This is due to the fact that CUDA is not a good language for critical section, warp divergence kills the performance.
Current mitigation
After that I discarded CUDA completely, and move the codes to host. I was planning to use one thread for a chunk, each chunk will be computed in parallel.
Biome generation is now moved to host side. Multi-threading support on host side was also removed in the recent release. Currently I have no idea why multi-threading will perform so much worse than a single, maybe something to do with the cache.
Quick test shows for a (512^2)*9 biome maps with similar algorithm used in minecraft:
(Compiler optimisation max to speed, cache size 8192 bytes for each layer, each thread is assigned with a unique cache so no critical section is needed - performance wise)