kerrlabajo commented 6 months ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I have managed to execute Multi-GPU DDP Training in Amazon SageMaker Training Job using my custom training image that contains the yolov5 repository pushed in my private Elastic Container Registry (ECR). The instance used was only one and the instance type is ml.g4dn.12xlarge for spot training job usage. I am currently using AWS SDK for .NET that executes AmazonSageMakerClient.CreateTrainingJob with my custom CreateTrainingJobRequest.

_Is there a way to execute the Multi-Machine DDP training that was recommended in Multi-GPU DDP Training for SageMaker with my current implementation?_

Here is my custom script to execute the training and exporting of the model in my custom training image that allows arguments to be passed from ContainerArguments:

import shutil
import subprocess
import argparse

def run_script(args, use_module=False):
    """
    Run a Python script with arguments.

    Parameters:
    `args` (list): The script and arguments to pass.
    `use_module` (bool): Whether to use the -m option to run the script as a module.

    Returns:
    `None`
    """
    if use_module:
        subprocess.run(["python3", "-m"] + args, check=True)
    else:
        subprocess.run(["python3"] + args, check=True)

def parse_arguments():
    parser = argparse.ArgumentParser(description='Run train.py and export.py scripts with command line arguments.')
    parser.add_argument('--img-size', type=str, required=True)
    parser.add_argument('--batch', type=str, required=True)
    parser.add_argument('--epochs', type=str, required=True)
    parser.add_argument('--weights', type=str, required=True)
    parser.add_argument('--data', type=str, required=True)
    parser.add_argument('--hyp', type=str, required=True)
    parser.add_argument('--project', type=str, required=True)
    parser.add_argument('--name', type=str, required=True)
    parser.add_argument('--patience', type=str, required=True)
    parser.add_argument('--workers', type=str, required=True)
    parser.add_argument('--optimizer', type=str, required=True)
    parser.add_argument('--device', type=str, required=True)
    parser.add_argument('--include', type=str, required=True)

    return parser.parse_args()

def main():
    """
    Main function to run `train.py` and `export.py` scripts with command line arguments.

    The first 24 arguments are passed to `train.py` and the remaining arguments are passed to export.py.

    Example:
    >>> python3 yolov5/train_and_export.py --img-size 640 --batch 1 --epochs 1 --weights yolov5s.pt 
    >>> --data /opt/ml/input/data/train/data.yaml --hyp hyp.scratch-low.yaml 
    >>> --project "/opt/ml/output/data/" --name "results" 
    >>> --patience 100 --workers 8 --optimizer SGD --device 0 --include onnx

    The 25th args is the start of the `export.py` args.

    Returns:
    None
    """
    args = parse_arguments()
    device_count = len(args.device.split(','))

    converter_args = [
        "yolov5/json_to_yaml_converter.py", '/opt/ml/input/config/hyperparameters.json'
    ]
    multi_gpu_ddp_args = [
        "torch.distributed.run", "--nproc_per_node", str(device_count)
    ]
    train_args = [
        "yolov5/train.py", "--img-size", args.img_size, "--batch", args.batch, "--epochs", args.epochs, 
        "--weights", args.weights, "--data", args.data, 
        "--hyp", '/opt/ml/input/config/custom-hyps.yaml' if args.hyp == "Custom" else args.hyp, 
        "--project", args.project, "--name", args.name, 
        "--patience", args.patience, "--workers", args.workers, "--optimizer", args.optimizer, 
        "--device", args.device, "--cache"
    ]
    export_args = [
        "yolov5/export.py", "--img-size", args.img_size, 
        "--weights", '/opt/ml/output/data/results/weights/best.pt', 
        "--include", args.include, "--device", args.device
    ]

    run_script(converter_args) if args.hyp == "Custom" else None

    if device_count > 1:
        run_script(multi_gpu_ddp_args + train_args, use_module=True)
    else:
        run_script(train_args)

    run_script(export_args)

    # Copy the best.onnx file to the /opt/ml/model/ directory
    shutil.copy2('/opt/ml/output/data/results/weights/best.onnx', '/opt/ml/model/')

if __name__ == "__main__":
    main()

Additional

No response

glenn-jocher commented 6 months ago

@kerrlabajo hello! It's great to see your interest in leveraging Multi-Machine DDP training with YOLOv5 on SageMaker. To execute Multi-Machine DDP training, you'll need to adjust your setup to support distributed training across multiple instances. SageMaker supports distributed training, but you'll need to ensure your setup is correctly configured for inter-machine communication.

For Multi-Machine DDP, you typically need to:

Ensure your SageMaker job is configured to launch multiple instances.
Modify your script to initialize the distributed environment correctly. This often involves setting up environment variables like MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK, which are crucial for distributed training. SageMaker might handle some of this for you, but you should verify.

In your script, you're already using torch.distributed.run which is a good start. Make sure that your SageMaker job configuration correctly specifies the number of instances and that each instance has access to the dataset. You might also need to adjust your script to ensure it correctly identifies the number of nodes and assigns ranks to each process across the nodes.

Remember, when using Ultralytics YOLOv5 in a commercial or proprietary solution, or even for internal company usage, you need an Ultralytics Enterprise License unless you're open-sourcing your entire project under AGPL-3.0. This includes any stage of R&D, development, and deployment, internal or external. There are no exceptions to this rule. For more details on licensing, please refer to our documentation.

If you have further questions or need clarification on licensing, feel free to ask. Good luck with your project! 🚀

kerrlabajo commented 6 months ago

One more question, if I understand the following commands correctly:

# On machine R
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

This will have to be iteratively executed with an increasing node rank R from 0...(N-1) for each of the N instances under one training job. Is that correct?

I am considering to open-sourcing the project so that it will be open for extension and other usages but as of its current use case being imposed and that I have signed an NDA, there is a chance that the software might fall under internal company usage. I will have to inquire my advisor and collaborator about how to proceed with the project and licensing.

Many thanks!

glenn-jocher commented 6 months ago

@kerrlabajo yes, you've got the right idea regarding the distributed training setup. When using torch.distributed.run for Multi-Machine DDP training, you indeed need to specify the --node_rank for each machine, which should range from 0 to (N-1) where N is the total number of nodes (or machines) involved in the training. Each machine will be assigned a unique node_rank, which helps in coordinating the distributed training process across all the machines.

For your setup in SageMaker, you'll typically configure the training job to launch multiple instances, and the orchestration of assigning node_rank and other related configurations might be managed by SageMaker's environment. However, it's crucial to ensure that your script and SageMaker's job configuration are aligned to correctly initialize and execute the distributed training.

Regarding your project and licensing considerations, it's great to hear that you're thinking about open-sourcing your project. Open-sourcing not only benefits the wider community but also aligns with the ethos of sharing and collaboration in the AI and machine learning fields. However, given your mention of an NDA and potential internal company usage, it's indeed important to discuss with your advisor and collaborator on how to proceed. If your project ends up being used internally within a company and not open-sourced, remember that an Ultralytics Enterprise License would be required as per our licensing terms. This applies to any usage of Ultralytics models, architectures, or code in commercial or proprietary solutions, or even for internal company usage, unless the entire project is open-sourced under AGPL-3.0.

Feel free to reach out if you have more questions or need further assistance as you navigate these considerations. Best of luck with your project and discussions regarding its future! 🌟

kerrlabajo commented 6 months ago

@glenn-jocher I've been testing multi-machine ddp training in numerous attempts to get my nodes connected but I was only able to proceed so far by following the documentation in Distributed Training Configuration Section to pass algo-1 as the master host/address and manage to finally establish connection only to end up with the main error `misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 169.254.255.18<42219> failed : Software caused connection abort. I would like to ask for your opinion and thoughts about how the operation usually proceeds in master and other machines.

Here are the full details from the master machine:

Other machines like algo-2 ... algo-n for example, are also experiencing the same output below except for the unique <42219> which are always different, but I am not sure if this was the cause.

{'current_host': 'algo-1', 'current_instance_type': 'ml.g4dn.xlarge', 'current_group_name': 'homogeneousCluster', 'hosts': ['algo-1', 'algo-2', 'algo-3', 'algo-4'], 'instance_groups': [{'instance_group_name': 'homogeneousCluster', 'instance_type': 'ml.g4dn.xlarge', 'hosts': ['algo-3', 'algo-1', 'algo-4', 'algo-2']}], 'network_interface_name': 'eth0'}
--
#033[34m#033[1mtrain: #033[0mweights=yolov5s.pt, cfg=, data=/opt/ml/input/data/train/data.yaml, hyp=hyp.no-augmentation.yaml, epochs=250, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=code/yolov5/data/hyps, resume_evolve=None, bucket=, cache=ram, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=/opt/ml/output/data/, name=results, exist_ok=True, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
#033[34m#033[1mgithub: #033[0mup to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-295-gac6c4383 Python-3.10.12 torch-2.2.2+cu121 CUDA:0 (Tesla T4, 14931MiB)
#033[34m#033[1mhyperparameters: #033[0mlr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0, hsv_s=0, hsv_v=0, degrees=0.0, translate=0, scale=0, shear=0, perspective=0.0, flipud=0.0, fliplr=0.0, mosaic=0.0, mixup=0.0, copy_paste=0.0
#033[34m#033[1mComet: #033[0mrun 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
#033[34m#033[1mTensorBoard: #033[0mStart with 'tensorboard --logdir /opt/ml/output/data', view at http://localhost:6006/
Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...
#015  0%\|          \| 0.00/755k [00:00<?, ?B/s]#015100%\|██████████\| 755k/755k [00:00<00:00, 121MB/s]
NCCL version 2.19.3+cuda12.3
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:07.0/../max_link_speed, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:07.0/../max_link_width, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-225-125:17:51 [0] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-225-125:17:51 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-225-125:17:51 [0] NCCL INFO === System : maxBw 1.2 totalBw 12.0 ===
ip-10-0-225-125:17:51 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-225-125:17:51 [0] NCCL INFO + PCI[5000.0] - NIC/0
ip-10-0-225-125:17:51 [0] NCCL INFO                 + NET[1.2] - NET/1 (1/0/1.250000)
ip-10-0-225-125:17:51 [0] NCCL INFO + PCI[12.0] - NIC/70
ip-10-0-225-125:17:51 [0] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-225-125:17:51 [0] NCCL INFO + PCI[12.0] - GPU/1E0 (0)
ip-10-0-225-125:17:51 [0] NCCL INFO ==========================================
ip-10-0-225-125:17:51 [0] NCCL INFO GPU/1E0 :GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB)
ip-10-0-225-125:17:51 [0] NCCL INFO NET/0 :GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (4/1.250000/PHB)
ip-10-0-225-125:17:51 [0] NCCL INFO NET/1 :GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (4/1.250000/PHB) NET/1 (0/5000.000000/LOC)
ip-10-0-225-125:17:51 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
ip-10-0-225-125:17:51 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
ip-10-0-225-125:17:51 [0] NCCL INFO  1 : NET/1 GPU/0 NET/1
ip-10-0-225-125:17:51 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
ip-10-0-225-125:17:51 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
ip-10-0-225-125:17:51 [0] NCCL INFO  1 : NET/1 GPU/0 NET/1
ip-10-0-225-125:17:51 [0] NCCL INFO Tree 0 : -1 -> 0 -> 2/-1/-1
ip-10-0-225-125:17:51 [0] NCCL INFO Tree 2 : 1 -> 0 -> -1/-1/-1
ip-10-0-225-125:17:51 [0] NCCL INFO Tree 1 : -1 -> 0 -> 2/-1/-1
ip-10-0-225-125:17:51 [0] NCCL INFO Tree 3 : 1 -> 0 -> -1/-1/-1
ip-10-0-225-125:17:51 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
ip-10-0-225-125:17:51 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
ip-10-0-225-125:17:51 [0] NCCL INFO Ring 02 : 3 -> 0 -> 1
ip-10-0-225-125:17:51 [0] NCCL INFO Ring 03 : 3 -> 0 -> 1
ip-10-0-225-125:17:52 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 169.254.255.18<42219> failed : Software caused connection abort
ip-10-0-225-125:17:52 [0] NCCL INFO misc/socket.cc:565 -> 2
ip-10-0-225-125:17:52 [0] NCCL INFO misc/socket.cc:587 -> 2
ip-10-0-225-125:17:52 [0] NCCL INFO transport/net_socket.cc:338 -> 2
ip-10-0-225-125:17:52 [0] NCCL INFO transport/net.cc:677 -> 2
ip-10-0-225-125:17:51 [0] NCCL INFO transport/net.cc:304 -> 2
ip-10-0-225-125:17:51 [0] NCCL INFO transport.cc:148 -> 2
ip-10-0-225-125:17:51 [0] NCCL INFO init.cc:1117 -> 2
ip-10-0-225-125:17:51 [0] NCCL INFO init.cc:1396 -> 2
ip-10-0-225-125:17:17 [0] NCCL INFO group.cc:418 -> 2
ip-10-0-225-125:17:17 [0] NCCL INFO group.cc:95 -> 2
ip-10-0-225-125:17:52 [0] NCCL INFO misc/socket.cc:47 -> 3
ip-10-0-225-125:17:52 [0] NCCL INFO misc/socket.cc:58 -> 3
ip-10-0-225-125:17:52 [0] NCCL INFO misc/socket.cc:773 -> 3
ip-10-0-225-125:17:52 [0] NCCL INFO proxy.cc:1374 -> 3
ip-10-0-225-125:17:52 [0] NCCL INFO proxy.cc:1415 -> 3
ip-10-0-225-125:17:52 [0] proxy.cc:1557 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
Traceback (most recent call last):  File "/code/yolov5/train.py", line 848, in <module>    main(opt)  File "/code/yolov5/train.py", line 623, in main    train(opt.hyp, opt, device, callbacks)  File "/code/yolov5/train.py", line 175, in train    with torch_distributed_zero_first(LOCAL_RANK):  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__    next(self.gen)  File "/code/yolov5/utils/torch_utils.py", line 100, in torch_distributed_zero_first    dist.barrier(device_ids=[0])  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper    return func(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 169.254.255.18<42219> failed : Software caused connection abort
[2024-03-29 10:57:14,155] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 17) of binary: /usr/bin/python3
Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 816, in <module>    main()  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper    return f(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main    run(args)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run    elastic_launch(  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__    return launch_agent(self._config, self._entrypoint, list(args))  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/code/yolov5/train.py FAILED
------------------------------------------------------------
Failures:  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-03-29_10:57:14  host      : algo-1  rank      : 0 (local_rank: 0)  exitcode  : 1 (pid: 17)  error_file: <N/A>  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):  File "/code/train_and_export.py", line 133, in <module>
Master IP address: algo-1
Local IP address: algo-1    main()  File "/code/train_and_export.py", line 120, in main    run_script(multi_instance_gpu_ddp_args + train_args, use_module=True)  File "/code/train_and_export.py", line 30, in run_script    subprocess.run(["python3", "-m"] + args, check=True)  File "/usr/lib/python3.10/subprocess.py", line 526, in run    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.run', '--nproc_per_node', '1', '--nnodes', '4', '--node_rank', '0', '--master_addr', 'algo-1', '--master_port', '29500', '/code/yolov5/train.py', '--img-size', '640', '--batch', '64', '--epochs', '250', '--weights', 'yolov5s.pt', '--data', '/opt/ml/input/data/train/data.yaml', '--hyp', 'hyp.no-augmentation.yaml', '--project', '/opt/ml/output/data/', '--name', 'results', '--patience', '100', '--workers', '8', '--optimizer', 'SGD', '--device', '0', '--cache', '--exist-ok']' returned non-zero exit status 1.

Full details of the second machine and so on `algo-2` ... `algo-n`

{'current_host': 'algo-2', 'current_instance_type': 'ml.g4dn.xlarge', 'current_group_name': 'homogeneousCluster', 'hosts': ['algo-1', 'algo-2', 'algo-3', 'algo-4'], 'instance_groups': [{'instance_group_name': 'homogeneousCluster', 'instance_type': 'ml.g4dn.xlarge', 'hosts': ['algo-3', 'algo-1', 'algo-4', 'algo-2']}], 'network_interface_name': 'eth0'}
--
#015  0%\|          \| 0.00/755k [00:00<?, ?B/s]#015100%\|██████████\| 755k/755k [00:00<00:00, 125MB/s]
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:07.0/../max_link_speed, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:07.0/../max_link_width, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-247-203:16:36 [0] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-247-203:16:36 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-247-203:16:36 [0] NCCL INFO === System : maxBw 1.2 totalBw 12.0 ===
ip-10-0-247-203:16:36 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
ip-10-0-247-203:16:36 [0] NCCL INFO + PCI[5000.0] - NIC/0
ip-10-0-247-203:16:36 [0] NCCL INFO                 + NET[1.2] - NET/1 (1/0/1.250000)
ip-10-0-247-203:16:36 [0] NCCL INFO + PCI[12.0] - NIC/70
ip-10-0-247-203:16:36 [0] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-247-203:16:36 [0] NCCL INFO + PCI[12.0] - GPU/1E0 (1)
ip-10-0-247-203:16:36 [0] NCCL INFO ==========================================
ip-10-0-247-203:16:36 [0] NCCL INFO GPU/1E0 :GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB)
ip-10-0-247-203:16:36 [0] NCCL INFO NET/0 :GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (4/1.250000/PHB)
ip-10-0-247-203:16:36 [0] NCCL INFO NET/1 :GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (4/1.250000/PHB) NET/1 (0/5000.000000/LOC)
ip-10-0-247-203:16:36 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
ip-10-0-247-203:16:36 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
ip-10-0-247-203:16:36 [0] NCCL INFO  1 : NET/1 GPU/1 NET/1
ip-10-0-247-203:16:36 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
ip-10-0-247-203:16:36 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
ip-10-0-247-203:16:36 [0] NCCL INFO  1 : NET/1 GPU/1 NET/1
ip-10-0-247-203:16:36 [0] NCCL INFO Tree 0 : 2 -> 1 -> -1/-1/-1
ip-10-0-247-203:16:36 [0] NCCL INFO Tree 2 : 3 -> 1 -> 2/0/-1
ip-10-0-247-203:16:36 [0] NCCL INFO Tree 1 : 2 -> 1 -> -1/-1/-1
ip-10-0-247-203:16:36 [0] NCCL INFO Tree 3 : 3 -> 1 -> 2/0/-1
ip-10-0-247-203:16:36 [0] NCCL INFO Ring 00 : 0 -> 1 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO Ring 01 : 0 -> 1 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO Ring 02 : 0 -> 1 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO Ring 03 : 0 -> 1 -> 2
ip-10-0-247-203:16:37 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 169.254.255.18<58539> failed : Software caused connection abort
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:565 -> 2
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:587 -> 2
ip-10-0-247-203:16:37 [0] NCCL INFO transport/net_socket.cc:338 -> 2
ip-10-0-247-203:16:37 [0] NCCL INFO transport/net.cc:677 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO transport/net.cc:304 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO transport.cc:148 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO init.cc:1117 -> 2
ip-10-0-247-203:16:36 [0] NCCL INFO init.cc:1396 -> 2
ip-10-0-247-203:16:16 [0] NCCL INFO group.cc:418 -> 2
ip-10-0-247-203:16:16 [0] NCCL INFO group.cc:95 -> 2
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:47 -> 3
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:58 -> 3
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:773 -> 3
ip-10-0-247-203:16:37 [0] NCCL INFO proxy.cc:1374 -> 3
ip-10-0-247-203:16:37 [0] proxy.cc:1523 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
ip-10-0-247-203:16:37 [0] NCCL INFO misc/socket.cc:806 -> 3
ip-10-0-247-203:16:37 [0] proxy.cc:1533 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
ip-10-0-247-203:16:37 [0] proxy.cc:1557 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
Traceback (most recent call last):  File "/code/yolov5/train.py", line 848, in <module>    main(opt)  File "/code/yolov5/train.py", line 623, in main    train(opt.hyp, opt, device, callbacks)  File "/code/yolov5/train.py", line 175, in train    with torch_distributed_zero_first(LOCAL_RANK):  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__    next(self.gen)  File "/code/yolov5/utils/torch_utils.py", line 100, in torch_distributed_zero_first    dist.barrier(device_ids=[0])  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper    return func(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 169.254.255.18<58539> failed : Software caused connection abort
[2024-03-29 10:57:14,159] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 16) of binary: /usr/bin/python3
Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 816, in <module>    main()  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper    return f(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main    run(args)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run    elastic_launch(  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__    return launch_agent(self._config, self._entrypoint, list(args))  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/code/yolov5/train.py FAILED
------------------------------------------------------------
Failures:  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-03-29_10:57:14  host      : algo-2  rank      : 1 (local_rank: 0)  exitcode  : 1 (pid: 16)  error_file: <N/A>  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):  File "/code/train_and_export.py", line 133, in <module>
Master IP address: algo-1
Local IP address: algo-2    main()  File "/code/train_and_export.py", line 120, in main    run_script(multi_instance_gpu_ddp_args + train_args, use_module=True)  File "/code/train_and_export.py", line 30, in run_script    subprocess.run(["python3", "-m"] + args, check=True)  File "/usr/lib/python3.10/subprocess.py", line 526, in run    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.run', '--nproc_per_node', '1', '--nnodes', '4', '--node_rank', '1', '--master_addr', 'algo-1', '--master_port', '29500', '/code/yolov5/train.py', '--img-size', '640', '--batch', '64', '--epochs', '250', '--weights', 'yolov5s.pt', '--data', '/opt/ml/input/data/train/data.yaml', '--hyp', 'hyp.no-augmentation.yaml', '--project', '/opt/ml/output/data/', '--name', 'results', '--patience', '100', '--workers', '8', '--optimizer', 'SGD', '--device', '0', '--cache', '--exist-ok']' returned non-zero exit status 1.

Here are the code changes

Added get_node_rank to utilize /opt/ml/input/config/resourceconfig.json that SageMaker automatically makes.
Used the current_host and node_rank from the hosts indeces.

Added multi_instance_gpu_ddp_args that handles multi-machine ddp training.


import datetime
import shutil
import subprocess
import argparse
import json
import os
import torch.distributed as dist
import socket

def get_node_rank(): with open('/opt/ml/input/config/resourceconfig.json') as f: data = json.load(f) current_host = data['current_host'] hosts = data['hosts'] node_rank = hosts.index(current_host) return current_host, node_rank

def run_script(args, use_module=False): """ Run a Python script with arguments.

Parameters:
`args` (list): The script and arguments to pass.
`use_module` (bool): Whether to use the -m option to run the script as a module.

Returns:
`None`
"""
if use_module:
    subprocess.run(["python3", "-m"] + args, check=True)
else:
    subprocess.run(["python3"] + args, check=True)

def parse_arguments(): parser = argparse.ArgumentParser(description='Run train.py and export.py scripts with command line arguments.') parser.add_argument('--img-size', type=str, required=True) parser.add_argument('--batch', type=str, required=True) parser.add_argument('--epochs', type=str, required=True) parser.add_argument('--weights', type=str, required=True) parser.add_argument('--data', type=str, required=True) parser.add_argument('--hyp', type=str, required=True) parser.add_argument('--project', type=str, required=True) parser.add_argument('--name', type=str, required=True) parser.add_argument('--patience', type=str, required=True) parser.add_argument('--workers', type=str, required=True) parser.add_argument('--optimizer', type=str, required=True) parser.add_argument('--device', type=str, required=True) parser.add_argument('--include', type=str, required=True) parser.add_argument('--nnodes', type=str, required=True)

parser.add_argument('--node-rank', type=str, required=True)

# parser.add_argument('--master-addr', type=str, required=True)
# parser.add_argument('--master_port', type=str, required=True)

return parser.parse_args()

def main(): """ Main function to run train.py and export.py scripts with command line arguments.

The first 24 arguments are passed to `train.py` and the remaining arguments are passed to export.py.

Example:
>>> python3 /code/train_and_export.py --img-size 640 --batch 1 --epochs 1 --weights yolov5s.pt 
>>> --data /opt/ml/input/data/train/data.yaml --hyp hyp.scratch-low.yaml 
>>> --project "/opt/ml/output/data/" --name "results" 
>>> --patience 100 --workers 8 --optimizer SGD --device 0 --include onnx --nnodes 1

Returns:
None
"""
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "GRAPH"
args = parse_arguments()
device_count = len(args.device.split(','))
current_host, node_rank = get_node_rank()
master_host = 'algo-1'
master_port = "29500"

resource_config_args = [
    "/code/resource_config_reader.py", '/opt/ml/input/config/resourceconfig.json'
]
converter_args = [
    "/code/json_to_yaml_converter.py", '/opt/ml/input/config/hyperparameters.json'
]
multi_gpu_ddp_args = [
    "torch.distributed.run", "--nproc_per_node", str(device_count)
]
multi_instance_gpu_ddp_args = [
    "torch.distributed.run", "--nproc_per_node", str(device_count), 
    "--nnodes", args.nnodes, "--node_rank", str(node_rank), 
    "--master_addr", master_host, "--master_port", master_port
]
train_args = [
    "/code/yolov5/train.py", "--img-size", args.img_size, "--batch", args.batch, "--epochs", args.epochs, 
    "--weights", args.weights, "--data", args.data, 
    "--hyp", '/opt/ml/input/config/custom-hyps.yaml' if args.hyp == "Custom" else args.hyp, 
    "--project", args.project, "--name", args.name, 
    "--patience", args.patience, "--workers", args.workers, "--optimizer", args.optimizer, 
    "--device", args.device, "--cache", "--exist-ok"
]
export_args = [
    "/code/yolov5/export.py", "--img-size", args.img_size, 
    "--weights", args.project + args.name + '/weights/best.pt', 
    "--include", args.include, "--device", args.device
]

print("Master IP address:", master_host)
print("Local IP address:", current_host)

run_script(resource_config_args)

run_script(converter_args) if args.hyp == "Custom" else None

if int(args.nnodes) > 1:
    run_script(multi_instance_gpu_ddp_args + train_args, use_module=True)

if device_count > 1:
    run_script(multi_gpu_ddp_args + train_args, use_module=True)
else:
    run_script(train_args)

run_script(export_args)

# Copy the best.onnx file to the /opt/ml/model/ directory
shutil.copy2('/opt/ml/output/data/results/weights/best.onnx', '/opt/ml/model/')

if name == "main": main()


# Concerns
> Output will only be shown on master machine!
1. This was mentioned in [Ultralytics YOLOv8 Docs Multi-GPU Training](https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training/#multi-gpu-distributeddataparallel-mode-recommended). Based from the log streams I presented from machine `algo-2`, is it supposed to show that kind of output or was it showing signs that it has really established a connection with machine `algo-1` but was just aborted?
2. If it was not supposed to show any output, could it be that machine `algo-2` ... `algo-n` was executing the training on its own connection and that was the reason it could not connect with machine `algo-1`?
3. I've come across and attempted to use `torch.distributed.init_process_group(backend='nccl', rank=node_rank, world_size=int(args.nnodes) * device_count, init_method=init_method)` that utilizes `master_addr = socket.gethostbyname('algo-1')` and passing it to `init_method = f"tcp://{master_addr}:{master_port}"` only to encounter another error such as:
```bash
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12355 (errno: 98 - Address already in use).

I am not sure if torch.distributed.init_process_group is necessary for torch.distributed.run neither if I'm going the right direction with this attempt.

I really appreciate taking your time to read this. I just wanted to lay out this error encounters and potentially find out ways/solutions to proceed with my use case in hopes that it can also provide insights for similar issues.

Many thanks in advance!

glenn-jocher commented 5 months ago

@kerrlabajo, it sounds like you're encountering some common challenges with distributed training setup, particularly around networking and process group initialization 😅. Your detailed tracebacks and configuration efforts provide a good starting point for troubleshooting.

Connection Issues: The NCCL warnings about socket connections indicate there might be network configuration issues preventing successful communication between your nodes. Ensure that your network setup allows for inter-node communication on the ports being used. SageMaker environments should generally handle this, but it's worth verifying if there are any security groups or network policies interfering with the expected traffic.
Output on Master Node Only: It's normal for the majority of the log output to appear from the master node in a distributed setup. However, you should still see some initialization logs from other nodes if the setup is correct. If you're not seeing any logs from other machines at all, it might indicate they're not correctly initiating or joining the distributed training process.
Address In Use: The error about the address already being in use suggests that the port you're attempting to bind to for communication is already occupied. This could happen if a previous training job didn't properly release the port or if another process is using it. Try using a different port number to see if the issue persists.

Regarding torch.distributed.init_process_group, it's indeed essential for setting up the environment for distributed training but torch.distributed.run should abstract away the need for manually initializing the process group in most cases. You may still need to ensure torch.distributed.run is correctly invoking your script with the appropriate environment variables set for NCCL to function properly.

Your attempts and thought process are going in the right direction. Here's a couple of extra troubleshooting steps:

Double-check the --master_addr and --master_port arguments to ensure they're correctly specified for your environment. --master_addr should be reachable from all nodes.
Inspect any security rules or network polices in your environment that could be blocking the NCCL ports or inter-node communication.

I hope these insights help you move forward. Distributed training can be tricky, especially with networking nuances, but you're on the right track! Keep experimenting, and don't hesitate to reach out for more assistance. Happy coding! 🚀

kerrlabajo commented 5 months ago

@glenn-jocher I have finally resolved the connection issue by simply adding an environment variable, NCCL_SOCKET_IFNAME=eth0 to the environment that my code is running on. I almost disregarded this configuration that was found in the Troubleshooting section wherein it stated:

Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0).

The network interface name is also proven to be correct from one of my logs that I have printed to read from /opt/ml/input/config/resourceconfig.json as advised by this Distributed Training Configuration in SageMaker and the following are:

{'current_host': 'algo-1', 'current_instance_type': 'ml.g4dn.xlarge', 'current_group_name': 'homogeneousCluster', 'hosts': ['algo-1', 'algo-2', 'algo-3', 'algo-4'], 'instance_groups': [{'instance_group_name': 'homogeneousCluster', 'instance_type': 'ml.g4dn.xlarge', 'hosts': ['algo-3', 'algo-1', 'algo-4', 'algo-2']}], 'network_interface_name': 'eth0'}

The following instances were tested with different batch sizes, 250 epochs, hyp.no-augmentation.yaml and 5 instances/nodes to be used:

ml.g4dn.8xlarge: Batch Size of 80 (1x NVIDIA T4 Tensor GPU w/ 16 GB GPU Memory * 5 nodes)
ml.g4dn.12xlarge: Batch Size of 320 (4x NVIDIA T4 Tensor GPU w/ 16 GB GPU Memory * 5 nodes)
ml.p3.2xlarge: Batch Size of 80 (1x NVIDIA Tesla V100 GPU w/ 16 GB GPU Memory * 5 nodes)
ml.p3.8xlarge: Batch Size of 320 (4x NVIDIA Tesla V100 GPU w/ 16 GB GPU Memory * 5 nodes)

Majority of these tests had successful training jobs except for ml.p3.8xlarge wherein it was able to train and export the model but before being able to upload the generated model to my S3 bucket, the fifth node (algo-4) of this instance returned the following output:

WARNING:__main__:
--
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
#015  0%\|          \| 0.00/755k [00:00<?, ?B/s]#015100%\|██████████\| 755k/755k [00:00<00:00, 85.7MB/s]
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_speed, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1b.0/../max_link_width, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1c.0/../max_link_width, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1d.0/../max_link_width, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_speed, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1e.0/../max_link_width, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-103-105:42:123 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:40:122 [0] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-103-105:40:122 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:40:122 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:40:122 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:40:122 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_speed, ignoring
ip-10-0-103-105:41:124 [1] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-103-105:41:124 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:41:124 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:41:124 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:41:124 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:05.0/../max_link_width, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
ip-10-0-103-105:43:121 [3] NCCL INFO KV Convert to int : could not find value of 'Unknown' in dictionary, falling back to 60
ip-10-0-103-105:43:121 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:43:121 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:43:121 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
ip-10-0-103-105:42:123 [2] NCCL INFO === System : maxBw 1.2 totalBw 100.0 ===
ip-10-0-103-105:42:123 [2] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/1)
ip-10-0-103-105:42:123 [2] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-103-105:42:123 [2] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-103-105:42:123 [2] NCCL INFO + PCI[12.0] - GPU/1B0 (16)
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1D0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:42:123 [2] NCCL INFO + PCI[12.0] - GPU/1C0 (17)
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1E0
ip-10-0-103-105:42:123 [2] NCCL INFO + PCI[12.0] - GPU/1D0 (18)
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1C0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:42:123 [2] NCCL INFO + PCI[12.0] - GPU/1E0 (19)
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[40.0] - GPU/1B0
ip-10-0-103-105:42:123 [2] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:42:123 [2] NCCL INFO ==========================================
ip-10-0-103-105:42:123 [2] NCCL INFO GPU/1B0 :GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/20.000000/NVL) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:42:123 [2] NCCL INFO GPU/1C0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (1/20.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:42:123 [2] NCCL INFO GPU/1D0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (1/40.000000/NVL) GPU/1D0 (0/5000.000000/LOC) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:42:123 [2] NCCL INFO GPU/1E0 :GPU/1B0 (1/40.000000/NVL) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:42:123 [2] NCCL INFO NET/0 :GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-103-105:40:122 [0] NCCL INFO === System : maxBw 1.2 totalBw 100.0 ===
ip-10-0-103-105:40:122 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/1)
ip-10-0-103-105:40:122 [0] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-103-105:40:122 [0] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-103-105:40:122 [0] NCCL INFO + PCI[12.0] - GPU/1B0 (16)
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1D0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:40:122 [0] NCCL INFO + PCI[12.0] - GPU/1C0 (17)
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1E0
ip-10-0-103-105:40:122 [0] NCCL INFO + PCI[12.0] - GPU/1D0 (18)
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1C0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:40:122 [0] NCCL INFO + PCI[12.0] - GPU/1E0 (19)
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[40.0] - GPU/1B0
ip-10-0-103-105:40:122 [0] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:40:122 [0] NCCL INFO ==========================================
ip-10-0-103-105:40:122 [0] NCCL INFO GPU/1B0 :GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/20.000000/NVL) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO GPU/1C0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (1/20.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO GPU/1D0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (1/40.000000/NVL) GPU/1D0 (0/5000.000000/LOC) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO GPU/1E0 :GPU/1B0 (1/40.000000/NVL) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO NET/0 :GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-103-105:41:124 [1] NCCL INFO === System : maxBw 1.2 totalBw 100.0 ===
ip-10-0-103-105:41:124 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/1)
ip-10-0-103-105:41:124 [1] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-103-105:41:124 [1] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-103-105:41:124 [1] NCCL INFO + PCI[12.0] - GPU/1B0 (16)
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1D0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:41:124 [1] NCCL INFO + PCI[12.0] - GPU/1C0 (17)
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1E0
ip-10-0-103-105:41:124 [1] NCCL INFO + PCI[12.0] - GPU/1D0 (18)
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1C0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:41:124 [1] NCCL INFO + PCI[12.0] - GPU/1E0 (19)
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[40.0] - GPU/1B0
ip-10-0-103-105:41:124 [1] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:41:124 [1] NCCL INFO ==========================================
ip-10-0-103-105:42:123 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:42:123 [2] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:41:124 [1] NCCL INFO GPU/1B0 :GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/20.000000/NVL) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:41:124 [1] NCCL INFO GPU/1C0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (1/20.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:41:124 [1] NCCL INFO GPU/1D0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (1/40.000000/NVL) GPU/1D0 (0/5000.000000/LOC) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:41:124 [1] NCCL INFO GPU/1E0 :GPU/1B0 (1/40.000000/NVL) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:41:124 [1] NCCL INFO NET/0 :GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-103-105:42:123 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:42:123 [2] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:43:121 [3] NCCL INFO === System : maxBw 1.2 totalBw 100.0 ===
ip-10-0-103-105:43:121 [3] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/1)
ip-10-0-103-105:43:121 [3] NCCL INFO + PCI[12.0] - NIC/50
ip-10-0-103-105:43:121 [3] NCCL INFO               + NET[1.2] - NET/0 (0/0/1.250000)
ip-10-0-103-105:43:121 [3] NCCL INFO + PCI[12.0] - GPU/1B0 (16)
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1D0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:43:121 [3] NCCL INFO + PCI[12.0] - GPU/1C0 (17)
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1E0
ip-10-0-103-105:43:121 [3] NCCL INFO + PCI[12.0] - GPU/1D0 (18)
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1C0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1E0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1B0
ip-10-0-103-105:43:121 [3] NCCL INFO + PCI[12.0] - GPU/1E0 (19)
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1D0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[40.0] - GPU/1B0
ip-10-0-103-105:43:121 [3] NCCL INFO               + NVL[20.0] - GPU/1C0
ip-10-0-103-105:43:121 [3] NCCL INFO ==========================================
ip-10-0-103-105:43:121 [3] NCCL INFO GPU/1B0 :GPU/1B0 (0/5000.000000/LOC) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/20.000000/NVL) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:43:121 [3] NCCL INFO GPU/1C0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (0/5000.000000/LOC) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (1/20.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:43:121 [3] NCCL INFO GPU/1D0 :GPU/1B0 (1/20.000000/NVL) GPU/1C0 (1/40.000000/NVL) GPU/1D0 (0/5000.000000/LOC) GPU/1E0 (1/40.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:40:122 [0] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:43:121 [3] NCCL INFO GPU/1E0 :GPU/1B0 (1/40.000000/NVL) GPU/1C0 (1/20.000000/NVL) GPU/1D0 (1/40.000000/NVL) GPU/1E0 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/12.000000/PHB) NET/0 (3/1.250000/PHB)
ip-10-0-103-105:43:121 [3] NCCL INFO NET/0 :GPU/1B0 (3/1.250000/PHB) GPU/1C0 (3/1.250000/PHB) GPU/1D0 (3/1.250000/PHB) GPU/1E0 (3/1.250000/PHB) CPU/FFFFFFFFFFFFFFFF (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ip-10-0-103-105:40:122 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:41:124 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:41:124 [1] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:40:122 [0] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:41:124 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:41:124 [1] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:43:121 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:43:121 [3] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:43:121 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 2.400000/1.200000, type NVL/PHB, sameChannels 1
ip-10-0-103-105:43:121 [3] NCCL INFO  0 : NET/0 GPU/16 GPU/17 GPU/18 GPU/19 NET/0
ip-10-0-103-105:43:121 [3] NCCL INFO Ring 00 : 18 -> 19 -> 0
ip-10-0-103-105:43:121 [3] NCCL INFO Ring 01 : 18 -> 19 -> 0
ip-10-0-103-105:42:123 [2] NCCL INFO Ring 00 : 17 -> 18 -> 19
ip-10-0-103-105:42:123 [2] NCCL INFO Ring 01 : 17 -> 18 -> 19
ip-10-0-103-105:41:124 [1] NCCL INFO Tree 0 : 16 -> 17 -> 18/8/-1
ip-10-0-103-105:41:124 [1] NCCL INFO Tree 1 : 16 -> 17 -> 18/-1/-1
ip-10-0-103-105:41:124 [1] NCCL INFO Ring 00 : 16 -> 17 -> 18
ip-10-0-103-105:41:124 [1] NCCL INFO Ring 01 : 16 -> 17 -> 18
ip-10-0-103-105:40:122 [0] NCCL INFO Tree 0 : 0 -> 16 -> 17/-1/-1
ip-10-0-103-105:40:122 [0] NCCL INFO Tree 1 : 12 -> 16 -> 17/-1/-1
ip-10-0-103-105:40:122 [0] NCCL INFO Ring 00 : 15 -> 16 -> 17
ip-10-0-103-105:40:122 [0] NCCL INFO Ring 01 : 15 -> 16 -> 17
#015#033[34m#033[1mtrain: #033[0mScanning /opt/ml/input/data/train/labels...:   0%\|          \| 0/1571 [00:00<?, ?it/s]#015#033[34m#033[1mtrain: #033[0mScanning /opt/ml/input/data/train/labels... 380 images, 0 backgrounds, 0 corrupt:  24%\|██▍       \| 380/1571 [00:00<00:00, 3788.16it/s]#015#033[34m#033[1mtrain: #033[0mScanning /opt/ml/input/data/train/labels... 769 images, 0 backgrounds, 0 corrupt:  49%\|████▉     \| 769/1571 [00:00<00:00, 3845.34it/s]#015#033[34m#033[1mtrain: #033[0mScanning /opt/ml/input/data/train/labels... 1401 images, 11 backgrounds, 0 corrupt:  90%\|████████▉ \| 1412/1571 [00:00<00:00, 5022.38it/s]#015#033[34m#033[1mtrain: #033[0mScanning /opt/ml/input/data/train/labels... 1560 images, 11 backgrounds, 0 corrupt: 100%\|██████████\| 1571/1571 [00:00<00:00, 5027.10it/s]
#015  0%\|          \| 0/78 [00:00<?, ?it/s]#015#033[34m#033[1mtrain: #033[0mCaching images (4.5GB ram): 100%\|██████████\| 78/78 [00:00<00:00, 1230.84it/s]
malloc_consolidate(): invalid chunk size
[2024-04-12 01:21:28,792] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 3 (pid: 43) of binary: /usr/bin/python3
Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main    return _run_code(code, main_globals, None,  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code    exec(code, run_globals)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 816, in <module>    main()  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper    return f(*args, **kwargs)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main    run(args)  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run    elastic_launch(  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__    return launch_agent(self._config, self._entrypoint, list(args))  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
/code/yolov5/train.py FAILED
---------------------------------------------------
Failures:  <NO_OTHER_FAILURES>
---------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-04-12_01:21:28  host      : algo-5  rank      : 19 (local_rank: 3)  exitcode  : -6 (pid: 43)  error_file: <N/A>  traceback : Signal 6 (SIGABRT) received by PID 43
===================================================
Master Host: algo-1
Current Host: algo-5
Node Rank:  4
Hosts:  ['algo-1', 'algo-2', 'algo-3', 'algo-4', 'algo-5']
Command '['python3', '-m', 'torch.distributed.run', '--nproc_per_node', '4', '--nnodes', '5', '--node_rank', '4', '--master_addr', 'algo-1', '--master_port', '29500', '/code/yolov5/train.py', '--img-size', '1280', '--batch', '320', '--epochs', '250', '--weights', 'yolov5n6.pt', '--data', '/opt/ml/input/data/train/MMX059XA_COVERED5B.yaml', '--hyp', 'hyp.no-augmentation.yaml', '--project', '/opt/ml/output/data/', '--name', 'results', '--patience', '100', '--workers', '8', '--optimizer', 'SGD', '--device', '0,1,2,3', '--cache', '--exist-ok']' returned non-zero exit status 1.

Questions

1. The specified error in this output is `malloc_consolidate(): invalid chunk size`. Could this be an exclusive error with the `torch.distributed.run`?

2. When downloading the output for a multi-node ddp training, for example I downloaded the output of my `ml.g4dn.12xlarge` output and the following shows:

Is this normal when adding the --exist-ok flag when executing train.py? or could this be the parallel results for each of the nodes? In addition to that, when extracting the files, it will simply replace the older files and combine as one which was okay, but should I just ignore this and let it replace upon extracting?

Many thanks in advance for taking the time to read this and for providing potential solutions that you may have gained from previous encounters of similar problem before.

glenn-jocher commented 5 months ago

Hello @kerrlabajo! It's great to hear that setting the NCCL_SOCKET_IFNAME=eth0 environment variable resolved your connection issues for multi-machine DDP training. Networking configurations often play a critical role in distributed setups, so this is a valuable insight for others facing similar challenges. 👍

Regarding your questions:

malloc_consolidate(): invalid chunk size Error: This error is generally associated with memory corruption issues rather than being specific to torch.distributed.run. It indicates a problem with memory allocation, which could be caused by various factors including insufficient memory for the process, or a bug in the software. Given that this occurred on a high-resource instance (ml.p3.8xlarge), it suggests that maybe there's an issue with how the memory is being managed or an edge case bug in the code or libraries being used. Debugging such issues can be tricky, but ensuring your environment has sufficient resources and running memory profiling may help.
Output Files with --exist-ok Flag: If you see multiple output files with similar names from your DDP training jobs, it's indeed possible they are results from different nodes. The --exist-ok flag allows the training script to overwrite existing files in the output directory, easing some file management issues in repetitive training experiments. When you download and subsequently extract these outputs, allowing the extraction process to overwrite files should be fine if you're primarily interested in the final, consolidated results. However, if you need to analyze results from individual nodes separately, you'd want to modify your setup to save these outputs to node-specific directories or collect them with unique naming conventions.

Your observations regarding file outputs in distributed DDP training indicate that you're encountering the parallel execution as expected. You might consider annotating or segregating the outputs for clarity if differentiation between node-specific results is necessary for your analysis or debugging process.

Thank you for sharing your detailed experience and insights, which undoubtedly contribute valuable knowledge to the community. If further issues arise or more assistance is needed, feel free to reach out! 🚀

github-actions[bot] commented 4 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / yolov5

How to execute Multi-Machine DDP training in SageMaker? #12811

Search before asking

Question

Additional

Here are the full details from the master machine:

Full details of the second machine and so on `algo-2` ... `algo-n`

Here are the code changes

parser.add_argument('--node-rank', type=str, required=True)

Questions

1. The specified error in this output is `malloc_consolidate(): invalid chunk size`. Could this be an exclusive error with the `torch.distributed.run`?

2. When downloading the output for a multi-node ddp training, for example I downloaded the output of my `ml.g4dn.12xlarge` output and the following shows:

ultralytics / yolov5

How to execute Multi-Machine DDP training in SageMaker? #12811

Search before asking

Question

Additional

Here are the full details from the master machine:

Full details of the second machine and so on algo-2 ... algo-n

Here are the code changes

parser.add_argument('--node-rank', type=str, required=True)

Questions

1. The specified error in this output is malloc_consolidate(): invalid chunk size. Could this be an exclusive error with the torch.distributed.run?

2. When downloading the output for a multi-node ddp training, for example I downloaded the output of my ml.g4dn.12xlarge output and the following shows:

Full details of the second machine and so on `algo-2` ... `algo-n`

1. The specified error in this output is `malloc_consolidate(): invalid chunk size`. Could this be an exclusive error with the `torch.distributed.run`?

2. When downloading the output for a multi-node ddp training, for example I downloaded the output of my `ml.g4dn.12xlarge` output and the following shows: