Closed HongLouyemeng closed 6 months ago
The same code in torch.multiprocessing, lldb debugger is not work.
import os
import time
import numpy as np
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
from torch.distributed._tensor import DTensor, DeviceMesh, Shard, Replicate, distribute_tensor,zeros
def run(rank, size):
a = torch.tensor([[0, 2.], [3, 0]])
a.neg()
def init_process(rank_id, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '12347'
os.environ[
"TORCH_DISTRIBUTED_DEBUG"
] = "DETAIL"
dist.init_process_group(backend, rank=rank_id, world_size=size)
fn(rank_id, size)
if __name__ == "__main__":
big_tensor = torch.arange(0,16).reshape(4,4)
size = 1
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()
@HongLouyemeng Hi.
In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like:
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install
You can try to use native gdb: mpiexec -n 2 gdb -ex run --args python test.py
Or try use impi's gdb tool:
mpiexec -n 2 -gdb python test.py
You can try to set breakpoint in oneccl's code and it should help.
If you observe some issue and you want to get backtrace, you can apply cmd in gdb console: thread apply all bt
@HongLouyemeng Hi. In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like:
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install
You can try to use native gdb:
mpiexec -n 2 gdb -ex run --args python test.py
Or try use impi's gdb tool:mpiexec -n 2 -gdb python test.py
You can try to set breakpoint in oneccl's code and it should help. If you observe some issue and you want to get backtrace, you can apply cmd in gdb console:
thread apply all bt
Thanks for your answer OVO
@HongLouyemeng Hope this helps, will close this issue.
Hi, I'm trying to debug pytorch distributed applications, but lldb doesn't jump to C++ code in a distributed scriptQAQ