pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.96k stars 22.63k forks source link

extend torch.distributed's `--tee` to log the nodename #75087

Open stas00 opened 2 years ago

stas00 commented 2 years ago

πŸš€ The feature, motivation and pitch

I'm currently using pt-1.11

This is a request to improve --tee 3 logging, Here is an example of the current log:

[default3]:python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
[default3]:Fatal Python error: Segmentation fault

Currently --tee 3 is somewhat useful by prefixing each log with [default{local_rank}] when used in:

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --tee 3 \
    "

This is a great step towards more usable logging and troubleshooting, but more is needed please:

Please make it log: nodename:rank:via a new flag or the existing one.

Of course, any way you choose is satisfactory, but if I had a say I'd format it as: f"{node}:{rank}: " (note the trailing space:). Explanation:

then the current log:

[default3]:python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
[default3]:Fatal Python error: Segmentation fault

becomes:

r12i0n8:3: python: src/psm2_nccl_net.c:756: mq_progress_loop: Assertion `r->used' failed.
r12i0n8:3: Fatal Python error: Segmentation fault

and now we know the exact node:rank that caused the problem and one can act on it. Currently I am debugging a segfault and I don't know how to find the node it crashed on since the whole c10d logging vanishes completely on segfault. No Root cause, no any other logging appears, which is normally there when there is a python assertion. It just ends with:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 269614 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 269615) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python

And a bonus request: could the --tee option be documented at https://pytorch.org/docs/stable/elastic/run.html and/or https://pytorch.org/docs/stable/distributed.html?

Thank you so much!

@cbalioglu

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

vadimkantorov commented 2 years ago

Also supporting the request for documenting --tee. It would be useful for debugging DDP issues especially given that wandb by default does some hacks for not printing from non-main rank: https://github.com/wandb/wandb/issues/3299#issuecomment-1055745184

vadimkantorov commented 2 years ago

In general somehow setting up duplication or redirects of stdout/stderr to corresponding rank files may be a useful function for the launcher

thomasw21 commented 1 year ago

I just discovered --role which does exactly this:

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --role $NODENAME: \ #Notice the trailing `:`
    --tee 3 \
    "

this prints logs with

[$NODENAME:1] blablabla