Open stas00 opened 2 years ago
Also supporting the request for documenting --tee
. It would be useful for debugging DDP issues especially given that wandb
by default does some hacks for not printing from non-main rank: https://github.com/wandb/wandb/issues/3299#issuecomment-1055745184
In general somehow setting up duplication or redirects of stdout/stderr to corresponding rank files may be a useful function for the launcher
I just discovered --role
which does exactly this:
export LAUNCHER="python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--role $NODENAME: \ #Notice the trailing `:`
--tee 3 \
"
this prints logs with
[$NODENAME:1] blablabla
π The feature, motivation and pitch
I'm currently using pt-1.11
This is a request to improve
--tee 3
logging, Here is an example of the current log:Currently
--tee 3
is somewhat useful by prefixing each log with[default{local_rank}]
when used in:This is a great step towards more usable logging and troubleshooting, but more is needed please:
Please make it log:
nodename:rank:
via a new flag or the existing one.Of course, any way you choose is satisfactory, but if I had a say I'd format it as:
f"{node}:{rank}: "
(note the trailing space:). Explanation:[]
is just noise and waste of horizontal spacedefault
is a waste of horizontal space and contributes no useful information.then the current log:
becomes:
and now we know the exact node:rank that caused the problem and one can act on it. Currently I am debugging a segfault and I don't know how to find the node it crashed on since the whole c10d logging vanishes completely on segfault. No
Root cause
, no any other logging appears, which is normally there when there is a python assertion. It just ends with:And a bonus request: could the
--tee
option be documented at https://pytorch.org/docs/stable/elastic/run.html and/or https://pytorch.org/docs/stable/distributed.html?Thank you so much!
@cbalioglu
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang