extend torch.distributed's `--tee` to log the nodename

stas00 commented 2 years ago

🚀 The feature, motivation and pitch

I'm currently using pt-1.11

This is a request to improve --tee 3 logging, Here is an example of the current log:

[default3]:python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
[default3]:Fatal Python error: Segmentation fault

Currently --tee 3 is somewhat useful by prefixing each log with [default{local_rank}] when used in:

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --tee 3 \
    "

This is a great step towards more usable logging and troubleshooting, but more is needed please:

Please make it log: nodename:rank:via a new flag or the existing one.

Of course, any way you choose is satisfactory, but if I had a say I'd format it as: f"{node}:{rank}: " (note the trailing space:). Explanation:

the [] is just noise and waste of horizontal space
there is no whitespace separation at the moment between the prefix and the actual log line. the whitespace adds a bit of breathing space is needed to improve readability.
the default is a waste of horizontal space and contributes no useful information.

then the current log:

[default3]:python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
[default3]:Fatal Python error: Segmentation fault

becomes:

r12i0n8:3: python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
r12i0n8:3: Fatal Python error: Segmentation fault

and now we know the exact node:rank that caused the problem and one can act on it. Currently I am debugging a segfault and I don't know how to find the node it crashed on since the whole c10d logging vanishes completely on segfault. No Root cause, no any other logging appears, which is normally there when there is a python assertion. It just ends with:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 269614 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 269615) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python

And a bonus request: could the --tee option be documented at https://pytorch.org/docs/stable/elastic/run.html and/or https://pytorch.org/docs/stable/distributed.html?

Thank you so much!

@cbalioglu

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

vadimkantorov commented 2 years ago

Also supporting the request for documenting --tee. It would be useful for debugging DDP issues especially given that wandb by default does some hacks for not printing from non-main rank: https://github.com/wandb/wandb/issues/3299#issuecomment-1055745184

vadimkantorov commented 2 years ago

In general somehow setting up duplication or redirects of stdout/stderr to corresponding rank files may be a useful function for the launcher

thomasw21 commented 1 year ago

I just discovered --role which does exactly this:

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --role $NODENAME: \ #Notice the trailing `:`
    --tee 3 \
    "

this prints logs with

[$NODENAME:1] blablabla

pytorch / pytorch

extend torch.distributed's `--tee` to log the nodename #75087

🚀 The feature, motivation and pitch