mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.06k stars 407 forks source link

Simplify launcher world size parsing #3398

Closed mvpatel2000 closed 1 week ago

mvpatel2000 commented 2 weeks ago

What does this PR do?

Simplify launcher world size parsing. Now, composer -n correctly enables running on fewer GPUs when iterating on a single machine.

eracah commented 2 weeks ago

Any way we can do a test for this?

mvpatel2000 commented 2 weeks ago

Any way we can do a test for this?

Unfortunately not easily since pytest runs in a single instance and this would require a separate launch. But ill include a manual test

mvpatel2000 commented 2 weeks ago

can someone put a request changes hold to block during codefreeze? @dakinggg ?

bigning commented 2 weeks ago

I'm curious, why set env_var WORLD_SIZE for single node in the first place? It breaks the CPU test python -m pytest in interactive if there are mutliple GPUs. Error is missing RANK in env_var. This PR doesn't fix that, because here it's not 1. I had to unset the WORLD_SIZE

mvpatel2000 commented 2 weeks ago

I'm curious, why set env_var WORLD_SIZE for single node in the first place?

was bug in mcloud, it should set all env vars even if on single node

It breaks the CPU test python -m pytest

will investigate

mvpatel2000 commented 1 week ago

It breaks the CPU test python -m pytest

will investigate

Discussed offline, if you dont use the launcher it will still use the WORLD_SIZE env var which we cannot fix. This break is not related to this PR, you must use composer -n 1 pytest...