When running legateboost on multiple GPUs, the legate runtime maps followup cunumeric tasks to individual instances despite the data being distributed. The issue is being described and discussed in more detail here.
This causes performance to degrade as some code portions are being executed sequentially.
Current workaround is to run distributed code with the environment variable LEGATE_MIN_GPU_CHUNK=1 or LEGATE_TEST=1.
When running legateboost on multiple GPUs, the legate runtime maps followup cunumeric tasks to individual instances despite the data being distributed. The issue is being described and discussed in more detail here.
This causes performance to degrade as some code portions are being executed sequentially.
Current workaround is to run distributed code with the environment variable
LEGATE_MIN_GPU_CHUNK=1
orLEGATE_TEST=1
.