torchrun not identified

BitCalSaul commented 6 months ago

Hi, thank you for your effort and this is a real good tool! I tried to use it for ddp, and the command cannot be identified. I'm wondering if you could add support for it. This is the command in the config.txt. torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py

lartpang commented 6 months ago

@BitCalSaul

Okay, I'll fix that.

lartpang commented 6 months ago

@BitCalSaul

You can try the latest commit. I remove the argument --interpreter in the main script. And now, we only need to store the full commands in the config.txt and run the main script.

BitCalSaul commented 6 months ago

Thanks for your effort. I tried it but got some error. I put these commands in the txt, and for your clarification I only have two GPUs: torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py epochs=2 torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py epochs=1 It seems like Runit didn't recognize that each command will consume 2 gpus from the args "--npro-per-node=2", and run these two commands together, which leads to OOM problem.

lartpang commented 6 months ago

@BitCalSaul

Unfortunately I don't have a multi-card machine in my hands to test at the moment.

Please make sure that the machine itself has the ability to run both commands at the same time. The logic for determining remaining GPU memory is too simple in the current version, because the pytorch program needs to run for a while before it can get stable memory. So it is recommended that you experiment independently first to see if both can be run at the same time without reporting an OOM error.

If it's determined that it can run simultaneously when used standalone, but can't schedule multiple GPUs when using RunIt, then there's some bug that I'm not aware of.

BitCalSaul commented 6 months ago

I think these two commands for ddp cannot be run at the same time. My expectation is to run the first command for ddp first, which will use all memory from two GPUs, and with this tasked finished, the second command for ddp will start. But now it seems like Runit just run the first command on the first GPU and the second on the second GPU, which will surely lead to OOM.

lartpang commented 6 months ago

@BitCalSaul

My expectation is to run the first command for ddp first, which will use all memory from two GPUs, and with this tasked finished, the second command for ddp will start.

It can be implemented by a simple shell script, and a similar demo is shown in my anothor repo:

https://github.com/lartpang/ZoomNet/blob/9c65e6ca8c5ec2f23c4ad0c0413881f78546d4f8/test.sh#L1-L22

But now it seems like Runit just run the first command on the first GPU and the second on the second GPU, which will surely lead to OOM.

This may be because two commands are running at the same time, resulting in OOM. You can try to output the actual number of GPUs used by each program.

BitCalSaul commented 6 months ago

The new commit has some issue. With this code, the two commands will run in the same GPU even I input the GPU pools as [0,1]. python /home/user/Compressor/main.py epochs=50 python /home/user/Compressor/main.py epochs=50 The reason may be that my code's default GPU is GPU 0, so these two commands will both run in this GPU.

I have to specify the GPU number and it could run properly. python /home/jwq/Compressor/main.py epochs=50 dividor_value=7000 dgroup_id=0 gpu=[0] python /home/jwq/Compressor/main.py epochs=50 dividor_value=7000 dgroup_id=1 gpu=[1]

In the previous commit, even if I don't specify the GPU number, Runit could still dynamically select a GPU for my command.

lartpang commented 6 months ago

@BitCalSaul

Indeed, this is a newly introduced bug.

I seem to understand what the original problem was. runIt uses the environment variable to assign the actual GPUs to be used, https://github.com/lartpang/RunIt/blob/056b1cfd5aa1cf2c4beea192868c5c02463ce6d3/run_it.py#L63

but by default only a single GPU can be used.

This leads to incompatibility with multi-card programs like torchrun. Perhaps there should be a more flexible way to configure this, such as specifying both the number of commands and GPUs, and perhaps emphasizing sequential numbering?

BitCalSaul commented 6 months ago

I'm a little confused here, do you mean at this moment we have some workaround for several multi-card programs in a script?

lartpang commented 6 months ago

@BitCalSaul

In the current version, multi-card support is not available for the time being. This is because multiple cards require multiple GPU numbers to be assigned to the application at the same time, which is incompatible with the current scheduling method based on a single GPU number. This may take some time to improve the code. Of course, you can also submit better ideas.

BitCalSaul commented 6 months ago

Thanks, it seems now we could only use it for non-multiple card program.

lartpang commented 5 months ago

@BitCalSaul

I recently refactored this tool based on a process pool and a cross-process communication manager!

More details can be found in dev branch: https://github.com/lartpang/RunIt/tree/dev

Feel free to use it and give feedback.

BitCalSaul commented 5 months ago

@lartpang oh that's so great bro. I will try it once I am available (doing paper works recently). This tool is really useful for my work :)

lartpang / RunIt

torchrun not identified #1