Scalene doesnt work properly with torchrun / torch.distributed.run

deo-abhijit commented 4 months ago

I got error while running scalene with torch.distributed.run .

I am currently following this doc

python -m torch.distributed.run --nproc_per_node=8 --master_port=2333 tools/train.py projects/configs/VAD/VAD_base.py --launcher pytorch --deterministic --work-dir path/to/save/outputs

this command runs perfectly, but when i replace the python -m with scalene, it raises error. I think the main issue is my train_mz.py takes other arguments as input from command line. and scalene is prolly passing them as args to torch.distributed.run.main() function.

although this is just a speculation.

Also there is very similar stackoverflow question on exactly similar lines.

It would be really nice if someone could help me out here. Thanks

emeryberger commented 4 months ago

You can use --- to tell Scalene to stop processing arguments (so put all Scalene arguments first, then ---, then any other arguments), but I suspect this will not fix the problem. Please give it a try, though.

emeryberger commented 4 months ago

You might also try specifying --cpu to help isolate the issue (if it works, that tells us something).

deo-abhijit commented 4 months ago

You can use --- to tell Scalene to stop processing arguments (so put all Scalene arguments first, then ---, then any other arguments), but I suspect this will not fix the problem. Please give it a try, though.

Actually I had tried this as well, even the person who asked the question on stackoverflow also did try that.

But still it gave error

plasma-umass / scalene

Scalene doesnt work properly with torchrun / torch.distributed.run #823