Open BitCalSaul opened 6 months ago
@BitCalSaul
Okay, I'll fix that.
@BitCalSaul
You can try the latest commit. I remove the argument --interpreter
in the main script. And now, we only need to store the full commands in the config.txt and run the main script.
Thanks for your effort. I tried it but got some error. I put these commands in the txt, and for your clarification I only have two GPUs: torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py epochs=2 torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py epochs=1 It seems like Runit didn't recognize that each command will consume 2 gpus from the args "--npro-per-node=2", and run these two commands together, which leads to OOM problem.
@BitCalSaul
Unfortunately I don't have a multi-card machine in my hands to test at the moment.
Please make sure that the machine itself has the ability to run both commands at the same time. The logic for determining remaining GPU memory is too simple in the current version, because the pytorch program needs to run for a while before it can get stable memory. So it is recommended that you experiment independently first to see if both can be run at the same time without reporting an OOM error.
If it's determined that it can run simultaneously when used standalone, but can't schedule multiple GPUs when using RunIt
, then there's some bug that I'm not aware of.
I think these two commands for ddp cannot be run at the same time. My expectation is to run the first command for ddp first, which will use all memory from two GPUs, and with this tasked finished, the second command for ddp will start. But now it seems like Runit just run the first command on the first GPU and the second on the second GPU, which will surely lead to OOM.
@BitCalSaul
My expectation is to run the first command for ddp first, which will use all memory from two GPUs, and with this tasked finished, the second command for ddp will start.
It can be implemented by a simple shell script, and a similar demo is shown in my anothor repo:
https://github.com/lartpang/ZoomNet/blob/9c65e6ca8c5ec2f23c4ad0c0413881f78546d4f8/test.sh#L1-L22
But now it seems like Runit just run the first command on the first GPU and the second on the second GPU, which will surely lead to OOM.
This may be because two commands are running at the same time, resulting in OOM. You can try to output the actual number of GPUs used by each program.
The new commit has some issue. With this code, the two commands will run in the same GPU even I input the GPU pools as [0,1]. python /home/user/Compressor/main.py epochs=50 python /home/user/Compressor/main.py epochs=50 The reason may be that my code's default GPU is GPU 0, so these two commands will both run in this GPU.
I have to specify the GPU number and it could run properly. python /home/jwq/Compressor/main.py epochs=50 dividor_value=7000 dgroup_id=0 gpu=[0] python /home/jwq/Compressor/main.py epochs=50 dividor_value=7000 dgroup_id=1 gpu=[1]
In the previous commit, even if I don't specify the GPU number, Runit could still dynamically select a GPU for my command.
@BitCalSaul
Indeed, this is a newly introduced bug.
I seem to understand what the original problem was.
runIt
uses the environment variable to assign the actual GPUs to be used,
https://github.com/lartpang/RunIt/blob/056b1cfd5aa1cf2c4beea192868c5c02463ce6d3/run_it.py#L63
but by default only a single GPU can be used.
This leads to incompatibility with multi-card programs like torchrun
.
Perhaps there should be a more flexible way to configure this, such as specifying both the number of commands and GPUs, and perhaps emphasizing sequential numbering?
I'm a little confused here, do you mean at this moment we have some workaround for several multi-card programs in a script?
@BitCalSaul
In the current version, multi-card support is not available for the time being. This is because multiple cards require multiple GPU numbers to be assigned to the application at the same time, which is incompatible with the current scheduling method based on a single GPU number. This may take some time to improve the code. Of course, you can also submit better ideas.
Thanks, it seems now we could only use it for non-multiple card program.
@BitCalSaul
I recently refactored this tool based on a process pool and a cross-process communication manager!
More details can be found in dev
branch: https://github.com/lartpang/RunIt/tree/dev
Feel free to use it and give feedback.
@lartpang oh that's so great bro. I will try it once I am available (doing paper works recently). This tool is really useful for my work :)
Hi, thank you for your effort and this is a real good tool! I tried to use it for ddp, and the command cannot be identified. I'm wondering if you could add support for it. This is the command in the config.txt. torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py