oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
185 stars 66 forks source link

What tools do you use to debug code that mixes python and C++? #104

Closed HongLouyemeng closed 6 months ago

HongLouyemeng commented 7 months ago

Hi, I'm trying to debug pytorch distributed applications, but lldb doesn't jump to C++ code in a distributed scriptQAQ

HongLouyemeng commented 7 months ago

The same code in torch.multiprocessing, lldb debugger is not work. image image


import os
import time
import numpy as np
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
from torch.distributed._tensor import DTensor, DeviceMesh, Shard, Replicate, distribute_tensor,zeros

def run(rank, size):

  a = torch.tensor([[0, 2.], [3, 0]])
  a.neg()

def init_process(rank_id, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '12347'
    os.environ[
        "TORCH_DISTRIBUTED_DEBUG"
    ] = "DETAIL" 
    dist.init_process_group(backend, rank=rank_id, world_size=size)
    fn(rank_id, size)

if __name__ == "__main__":
    big_tensor = torch.arange(0,16).reshape(4,4)
    size = 1
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
nikitaxgusev commented 6 months ago

@HongLouyemeng Hi. In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like: cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install

You can try to use native gdb: mpiexec -n 2 gdb -ex run --args python test.py Or try use impi's gdb tool: mpiexec -n 2 -gdb python test.py

You can try to set breakpoint in oneccl's code and it should help. If you observe some issue and you want to get backtrace, you can apply cmd in gdb console: thread apply all bt

HongLouyemeng commented 6 months ago

@HongLouyemeng Hi. In our experience after setting the environment for running .py scenarios, you should build oneccl with debug type build, like: cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_BUILD_TYPE=debug && make -j && make install

You can try to use native gdb: mpiexec -n 2 gdb -ex run --args python test.py Or try use impi's gdb tool: mpiexec -n 2 -gdb python test.py

You can try to set breakpoint in oneccl's code and it should help. If you observe some issue and you want to get backtrace, you can apply cmd in gdb console: thread apply all bt

Thanks for your answer OVO

nikitaxgusev commented 6 months ago

@HongLouyemeng Hope this helps, will close this issue.