ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

[Ray Train] Pytorch profiler is not worked with Ray Train TorchTrainer #47131

Open KepingYan opened 3 months ago

KepingYan commented 3 months ago

What happened + What you expected to happen

If I add pytorch profiler tool in train_function of TorchTrainer, it will report an error:

Training started without custom configuration.
(TorchTrainer pid=794896) Started distributed worker processes:
(TorchTrainer pid=794896) - (ip=10.0.0.26, pid=795002) world_rank=0, local_rank=0, node_rank=0
(RayTrainWorker pid=795002) Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=795002) [W314 19:00:53.893223839 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
(RayTrainWorker pid=795002) /home/ykp/miniconda3/envs/test/lib/python3.9/site-packages/torch/profiler/profiler.py:406: UserWarning: Profiler won't be using warmup, this can skew profiler results
(RayTrainWorker pid=795002)   warn("Profiler won't be using warmup, this can skew profiler results")
(RayTrainWorker pid=795002) ERROR: External init callback must run in same thread as registerClient (1811936832 != -2124761280)

But if the train_func is called alone without TorchTrainer, the profiler will work normal. I followed this tutorial (https://docs.ray.io/en/master/ray-observability/user-guides/debug-apps/optimize-performance.html#performance-debugging-gpu-profiling), is there any other configuration that needs to be modified?

Versions / Dependencies

Ray 2.32.0 Python 3.9.18 Torch 2.4.0 OS Ubuntu 22.04.4

Reproduction script

import torch
from fvcore.nn import sigmoid_focal_loss
import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func():
    device = torch.device("cuda")
    inputs = torch.ones(340399,80).to(device)
    targets = torch.ones(340399,80).to(device)
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU],
        schedule=torch.profiler.schedule(wait=0, warmup=0, active=1, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('.'),
        with_stack=True
        ) as prof:
        loss_cls = sigmoid_focal_loss(
            inputs,
            targets,
            alpha=0.25,
            gamma=2.0,
            reduction="sum",
        )
        loss_cls_cpu = loss_cls.cpu()

    print(f"# {loss_cls_cpu}")

ray.init(address='auto')

scaling_config = ScalingConfig(num_workers=1, use_gpu=True)
trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        scaling_config=scaling_config
    )
result = trainer.fit()

# train_func()      # work normal if call this function alone

Issue Severity

High: It blocks me from completing my task.

zacharie-martin commented 2 months ago

Also facing this issue. Any progress?

hongpeng-guo commented 2 months ago

cc @KepingYan , @zacharie-martin . Thanks for flagging this out! I think this error message is just a logging from torch and should be non-blocking. While you may saw this error message, torch profiler still works at the same time. I have tried the above repro scripts, I can successfully got the .json trace file and view it using Tensorboard.

Screenshot 2024-09-19 at 11 41 54 PM Screenshot 2024-09-19 at 11 42 38 PM