Closed pengfei-luo closed 7 months ago
I have the same issue. Running rm -rf ~/.triton
resolves it for me.
Hello @pengfei-luo @fffffarmer, this is because deepspeed will save triton autotune cache when it exits, and given your home dir is on a NFS, such saving could be slow. The workaround would be to set the TRITON_CACHE_DIR
environment variable to point to your local hard disk.
If the autotune cache is disabled, like the way you comment it out (or delete the cache), it will not impact the quality of the training but is likely going to impose a 1-3 minutes delay when you start/resume the training as triton will perform autotune from scratch.
We plan to add a warning to explain this to the user when we detect the current cache dir is on a different file system.
Describe the bug When training with deepspeed and taking the ZeRO2 configuration, the program got stuck and did not exit after training. I had to use ctrl+c to end the process.
It showed the process stuck at
File "/data2/pfluo/micromamba/envs/torch220/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 66, in put with FileLock(self.lock_path):
.I printed the log of the filelock, and found that in repeated attempts to get the lock of the file.
I modified the file
deepspeed/ops/transformer/inference/triton/matmul_ext.py
by commenting out lines 66-69, and the program terminated properly.https://github.com/microsoft/DeepSpeed/blob/aed599b4422b1cdf7397abb05a58c3726523a333/deepspeed/ops/transformer/inference/triton/matmul_ext.py#L66-L69
And I found that the process also got stuck while executing
ds_report
. I'm not sure that such a solution has any impact on the training process. Also, I'm using an NFS filesystem, and I'm not sure if that has an effect on filelock, which in turn triggers this error.To Reproduce Steps to reproduce the behavior:
Expected behavior The training procedure should terminate normally.
ds_report output
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context I tried both deepspeed and torchrun.