reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

garrett361 commented 1 month ago

Describe the bug

Repeated calls into torch.dist.reduce_scatter_tensor eventually raise a ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY error in multi-node setups. Similar behavior is found when using Fully Sharded Data Parallel, which calls into reduce_scatter_tensor internally.

Script to reproduce is below. Steps:

Create source and destination tensors on all ranks in a multi-node setup.
Repeatedly reduce_scatter_tensor and print out memory readings at each step
Eventually, the above error is raised (without any corresponding jump in memory readings)

import argparse
import os

import intel_extension_for_pytorch as ipex  # noqa
import oneccl_bindings_for_pytorch  # noqa
import torch
import torch.distributed as dist

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--dim",
        type=int,
        default=2**30,
    )
    parser.add_argument(
        "--dtype",
        type=str,
        default="bfloat16",
    )
    parser.add_argument(
        "--max-steps",
        type=int,
        default=100,
    )
    args = parser.parse_args()
    return args

def main(dim: int, dtype: str, max_steps: int) -> None:
    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    device = torch.device(f"xpu:{local_rank}")
    torch.xpu.set_device(device)

    # Force dim to be divisible by the world size
    new_dim = world_size * (dim // world_size)
    if new_dim != dim:
        if not rank:
            print(
                f"Adjusting original {dim=} to {new_dim} in order to be divisible by {world_size=}",
                flush=True,
            )
        dim = new_dim

    try:
        dist.init_process_group("ccl")

        t_in = torch.randn(dim, dtype=getattr(torch, dtype), device=device)
        t_out = torch.empty(dim // world_size, dtype=getattr(torch, dtype), device=device)

        for step in range(1, max_steps + 1):
            dist.reduce_scatter_tensor(t_out, t_in, op=dist.ReduceOp.SUM)
            torch.xpu.synchronize()
            peak_mem_gib = torch.xpu.memory_stats()["allocated_bytes.all.peak"] / 2**30
            current_mem_gib = torch.xpu.memory_stats()["allocated_bytes.all.current"] / 2**30
            print(f"[{rank=}]: {step=} memory {peak_mem_gib=}, {current_mem_gib=}", flush=True)

    finally:
        dist.destroy_process_group()

if __name__ == "__main__":
    args = get_args()
    main(**vars(args))

Example logs:

[... snip ...]
[rank=14]: step=27 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=13]: step=27 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=17]: step=27 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=6]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=4]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=2]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=8]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=10]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=7]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=1]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=11]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=3]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=9]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=20]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=0]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=5]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=19]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=21]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=22]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=23]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=16]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=15]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=18]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=12]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=14]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=13]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=17]: step=28 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=6]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=11]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=2]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=8]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=10]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=0]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=4]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=1]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=3]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=9]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=20]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=7]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=5]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=23]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=22]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=12]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=18]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=15]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=14]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=16]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=21]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=13]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=19]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
[rank=17]: step=29 memory peak_mem_gib=2.083984375, current_mem_gib=2.083984375
2024:05:29-19:16:18:(202165) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202162) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202164) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202173) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202167) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202166) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(149693) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202168) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202163) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
/lus/gila/projects/Aurora_deployment/mk/decoders/alcf/set_torch_dist_env.sh: line 25: 200400 Aborted                 $@
x1921c5s2b0n0.hostmgmt2000.cm.americas.sgi.com: rank 6 exited with code 134
x1921c5s2b0n0.hostmgmt2000.cm.americas.sgi.com: rank 0 died from signal 15
2024:05:29-19:16:18:(149692) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY

The behavior seems specific to multi-node setups. I have not seen the same error raised on a single node.

Versions

Collecting environment information... PyTorch version: 2.1.0.post2+cxx11.abi PyTorch CXX11 ABI: Yes IPEX version: 2.1.30+xpu IPEX commit: 474a6b3cb Build type: Release

OS: SUSE Linux Enterprise Server 15 SP4 (x86_64) GCC version: (Spack GCC) 12.2.0 Clang version: N/A IGC version: 2024.1.0 (2024.1.0.20240308) CMake version: version 3.27.5 Libc version: glibc-2.31

Python version: 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-5.14.21-150400.24.55-default-x86_64-with-glibc2.31 Is XPU available: True DPCPP runtime version: latest MKL version: latest GPU models and configuration: [0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) Intel OpenCL ICD version: N/A Level Zero version: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU Max 9470C CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 52 Socket(s): 2 Stepping: 8 Frequency boost: enabled CPU max MHz: 2001.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 amx_tile flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 4.9 MiB (104 instances) L1i cache: 3.3 MiB (104 instances) L2 cache: 208 MiB (104 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-51,104-155 NUMA node1 CPU(s): 52-103,156-207 NUMA node2 CPU(s): NUMA node3 CPU(s): Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.1.30+xpu [pip3] numpy==1.23.5 [pip3] torch==2.1.0.post2+cxx11.abi [pip3] torchvision==0.16.0.post2+cxx11.abi [conda] intel-extension-for-pytorch 2.1.30+xpu pypi_0 pypi [conda] mkl 2024.1.0 intel_642 intel [conda] mkl-dpcpp 2024.1.0 intel_642 intel [conda] mkl-service 2.4.0 py39hc591bdc_44 intel [conda] mkl_fft 1.3.8 py39h6b114c4_70 intel [conda] mkl_random 1.2.4 py39h841069b_90 intel [conda] mkl_umath 0.1.1 py39h843e89b_100 intel [conda] numpy 1.23.5 pypi_0 pypi [conda] onemkl-sycl-blas 2024.1.0 intel_642 intel [conda] onemkl-sycl-datafitting 2024.1.0 intel_642 intel [conda] onemkl-sycl-dft 2024.1.0 intel_642 intel [conda] onemkl-sycl-lapack 2024.1.0 intel_642 intel [conda] onemkl-sycl-rng 2024.1.0 intel_642 intel [conda] onemkl-sycl-sparse 2024.1.0 intel_642 intel [conda] onemkl-sycl-stats 2024.1.0 intel_642 intel [conda] onemkl-sycl-vm 2024.1.0 intel_642 intel [conda] torch 2.1.0.post2+cxx11.abi pypi_0 pypi [conda] torchvision 0.16.0.post2+cxx11.abi pypi_0 pypi

garrett361 commented 1 month ago

Other notes:

I also tried a similar script with all_gather_into_tensor, but did not manage to trigger an OOM.
Perhaps coincidence, but each GPU device in the above tests (Intel 1550s, one device per tile{ has 60GiB memory, the input tensors are about 2GiB each and the script crashes just before the 30th step. Since, 30 * 2 GiB = 60GiB it seems plausible that the input tensors are somehow being copied and not freed which builds up to an OOM at the 30th step.

garrett361 commented 1 month ago

This seems to be a versioning and/or environment issue.

Discovered that when I revert to an older environment, the reduce_scatter_tensor test works fine.

Diff of the results from running collect_env.py in the two environments:

--- collect_env_frameworks_2024.04.15.002.txt   2024-05-29 19:13:16.000000000 +0000
+++ collect_env_frameworks_2023.12.15.001.txt   2024-05-29 23:46:40.000000000 +0000
@@ -1,14 +1,14 @@
 Collecting environment information...
-PyTorch version: 2.1.0.post2+cxx11.abi
+PyTorch version: 2.1.0a0+cxx11.abi
 PyTorch CXX11 ABI: Yes
-IPEX version: 2.1.30+xpu
-IPEX commit: 474a6b3cb
+IPEX version: 2.1.10+xpu
+IPEX commit: a12f9f650
 Build type: Release

 OS: SUSE Linux Enterprise Server 15 SP4 (x86_64)
 GCC version: (Spack GCC) 12.2.0
 Clang version: N/A
-IGC version: 2024.1.0 (2024.1.0.20240308)
+IGC version: 2024.0.0 (2024.0.0.20231017)
 CMake version: version 3.27.5
 Libc version: glibc-2.31

@@ -18,18 +18,18 @@
 DPCPP runtime version: latest
 MKL version: latest
 GPU models and configuration: 
-[0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
 Intel OpenCL ICD version: N/A
 Level Zero version: N/A

@@ -76,25 +76,25 @@
 Vulnerability Tsx async abort:   Not affected

 Versions of relevant libraries:
-[pip3] intel-extension-for-pytorch==2.1.30+xpu
+[pip3] intel-extension-for-pytorch==2.1.10+xpu
 [pip3] numpy==1.23.5
-[pip3] torch==2.1.0.post2+cxx11.abi
-[pip3] torchvision==0.16.0.post2+cxx11.abi
-[conda] intel-extension-for-pytorch 2.1.30+xpu               pypi_0    pypi
-[conda] mkl                       2024.1.0              intel_642    intel
-[conda] mkl-dpcpp                 2024.1.0              intel_642    intel
-[conda] mkl-service               2.4.0           py39hc591bdc_44    intel
-[conda] mkl_fft                   1.3.8           py39h6b114c4_70    intel
-[conda] mkl_random                1.2.4           py39h841069b_90    intel
-[conda] mkl_umath                 0.1.1           py39h843e89b_100    intel
+[pip3] torch==2.1.0a0+cxx11.abi
+[pip3] torchvision==0.16.2
+[conda] intel-extension-for-pytorch 2.1.10+xpu               pypi_0    pypi
+[conda] mkl                       2024.0.0            intel_49630    intel
+[conda] mkl-dpcpp                 2024.0.0            intel_49630    intel
+[conda] mkl-service               2.4.0           py39h3539a15_40    intel
+[conda] mkl_fft                   1.3.6           py39h1d81ff8_61    intel
+[conda] mkl_random                1.2.2           py39h5a378b4_81    intel
+[conda] mkl_umath                 0.1.1           py39h2b1685c_91    intel
 [conda] numpy                     1.23.5                   pypi_0    pypi
-[conda] onemkl-sycl-blas          2024.1.0              intel_642    intel
-[conda] onemkl-sycl-datafitting   2024.1.0              intel_642    intel
-[conda] onemkl-sycl-dft           2024.1.0              intel_642    intel
-[conda] onemkl-sycl-lapack        2024.1.0              intel_642    intel
-[conda] onemkl-sycl-rng           2024.1.0              intel_642    intel
-[conda] onemkl-sycl-sparse        2024.1.0              intel_642    intel
-[conda] onemkl-sycl-stats         2024.1.0              intel_642    intel
-[conda] onemkl-sycl-vm            2024.1.0              intel_642    intel
-[conda] torch                     2.1.0.post2+cxx11.abi          pypi_0    pypi
-[conda] torchvision               0.16.0.post2+cxx11.abi          pypi_0    pypi
+[conda] onemkl-sycl-blas          2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-datafitting   2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-dft           2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-lapack        2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-rng           2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-sparse        2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-stats         2024.0.0            intel_49630    intel
+[conda] onemkl-sycl-vm            2024.0.0            intel_49630    intel
+[conda] torch                     2.1.0a0+cxx11.abi          pypi_0    pypi
+[conda] torchvision               0.16.2                   pypi_0    pypi

Not sure yet which changes are causing the OOMs.

garrett361 commented 1 month ago

Maybe related: I'm also finding that if I replace reduce_scatter_tensor with a plain reduce_scatter, I hit:

RuntimeError: ProcessGroupCCL does not support reduce_scatter

This happens in both environments referenced above. Code for all tests can be found here (similar to the above).

xiguiw commented 1 month ago

@garrett361

Thanks for trying IPEX.

 what():  oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202166) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'

The log show CCL, I did not find you oneccl version. Could you list your ccl version?

Here, for each IPEX version, there is one ccl version. https://github.com/intel/torch-ccl/tree/v2.1.300%2Bxpu?tab=readme-ov-file#pytorch-api-align

intel-extension-for-pytorch 2.1.10+xpu intel-extension-for-pytorch 2.1.20+xpu intel-extension-for-pytorch 2.1.30+xpu

v2.1.0 ccl_torch2.1.300 v2.1.0 ccl_torch2.1.200 v2.1.0 ccl_torch2.1.100

Discovered that when I revert to an older environment, the reduce_scatter_tensor test works fine. Diff of the results from running collect_env.py in the two environments:

So if you install a new IPEX version, oneCCl need to be updated, too.

coreyjadams commented 1 month ago

Hi all,

Thanks @garrett361 for another bug report! I want to confirm I have reproduced this on Sunspot, with the 2024.1 oneAPI release and corresponding ipex.

Oneccl is linked to the version from 2024.1:

❯ ldd /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/oneccl_bindings_for_pytorch/_C.cpython-39-x86_64-linux-gnu.so | grep ccl
    liboneccl_bindings_for_pytorch.so => /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so (0x000015512f77e000)
    libccl.so.1 => /soft/compilers/oneapi/2024.04.15.001/oneapi/ccl/latest/lib/libccl.so.1 (0x000015510f612000)

I ran this on 4 nodes with 6 PVC each, so 48 ranks. Like @garrett361 , I could not reproduce this on a single node but required cross-node communication before things break.

garrett361 commented 1 month ago

Thanks @coreyjadams !

Also, while I reported that the reduce_scatter_tensor OOM goes away when reverting to a previous environment on Sunspot (frameworks/2023.12.15.001), I'm still finding that the same ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY gets triggered in that environment when performing multi-node FSDP training. Presumably due to some other collective, but TBD.

I will post follow-up issues once I have a better handle on what's going on.

coreyjadams commented 1 month ago

I also notice a dramatic timing difference between doing 12 ranks on one node (very very fast) and 12 ranks over 2 nodes (a lot slower). Yes, bandwith is not the same intranode vs. internode, but the difference is dramatic. It's about 0.22 seconds per step with 12 ranks over 2 nodes, and about 0.02s with all 12 ranks on one node.

garrett361 commented 1 month ago

@coreyjadams I don’t recall the Sunspot bandwidth numbers exactly, but an order of magnitude drop off for twelve processes per node sounds reasonable to me.

Thought an order of magnitude change in going from one node to multi-node was typical, IIRC

xiguiw commented 1 month ago

@garrett361 @coreyjadams

Did you upgrade your driver when you upgrade IPEX from 2.1.10+xpu to 2.1.30+xpu?

IPEX 2.1.30+xpu require driver LTS 803.29, while IPEX 2.1.10+xpu require driver 736.25

@-PyTorch version: 2.1.0.post2+cxx11.abi
+PyTorch version: 2.1.0a0+cxx11.abi
 PyTorch CXX11 ABI: Yes
-IPEX version: 2.1.30+xpu
-IPEX commit: 474a6b3cb
+IPEX version: 2.1.10+xpu
+IPEX commit: a12f9f650
 Build type: Release

I am trying to get access multi-node environment to reproduce this problem. I'll try to reproduce this (at least can run this at my side). Please expect slow response.

Meanwhile, IPEX release a new wheel package 2.1.30.post0 with some known issue fixed, one of them is memory leak. Would you please try this new package at your side and give your feedback? Thanks!

Here is the new package updated in https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu&os=linux%2fwsl2&package=pip)

garrett361 commented 1 month ago

Thanks for the pointer @xiguiw ! I'm not sure what driver version was used; will figure it out, test out the new wheel and get back to you ASAP.

garrett361 commented 1 month ago

Could you list your ccl version?

Running pip freeze in the new environment where reduce_scatter_tensor failed gives:

torch==2.1.0.post2+cxx11.abi
oneccl-bind-pt==2.1.300+xpu
intel-extension-for-pytorch==2.1.30+xpu

The old environment in which reduce_scatter_tensor does not OOM has

torch==2.1.0a0+cxx11.abi
oneccl-bind-pt==2.1.100+xpu
intel-extension-for-pytorch==2.1.10+xpu

@xiguiw is this the info you wanted regarding ccl above?

garrett361 commented 1 month ago

Some updates:

I believe I am using the 803.29 drivers now after loading the following module on Sunspot

module use /soft/preview-modulefiles/24.086.0
module load intel_compute_runtime/release/803.29

(Is there some command to more programmatically confirm which drivers I'm using?)

I installed the new ipex package into a venv based on the instructions above. pip freeze outputs include:

torch==2.1.0.post2+cxx11.abi
oneccl-bind-pt==2.1.300+xpu
intel_extension_for_pytorch==2.1.30.post0

With the new ipex wheel, the results remain the same as initially reported: reduce_scatter_tensor OOMs (and reduce_scatter gives a RuntimeError).

Summary table:

`torch/torch-ccl/ipex` Versions	`reduce_scatter_tensor` Result	`reduce_scatter` Result
2.1.0a0+cxx11.abi/2.1.10+xpu/2.1.100+xpu	Success	`RuntimeError: ProcessGroupCCL does not support reduce_scatter`
2.1.0.post2+cxx11.abi/2.1.300+xpu/2.1.30+xpu	`ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`	`RuntimeError: ProcessGroupCCL does not support reduce_scatter`
2.1.0.post2+cxx11.abi/2.1.300+xpu/2.1.30.post0	`ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`	`RuntimeError: ProcessGroupCCL does not support reduce_scatter`

I started using this code to test the various collectives.

xiguiw commented 1 month ago

@garrett361 Yes，this is the CCL version I wanted.

Thanks for confirm the issue on the new release package. We will look into this issue and let you know the progress.

garrett361 commented 1 month ago

Updates: the OOM only seems to appear for sufficiently large message sizes. If I run the same script but with 2 ** 28 elements in the reduce-scatter (rather than 2 ** 30), no OOM is hit, even after many iterations. The threshold for OOMing is somewhere around ~1GiB on the two-node, 1550 gpu setup I'm using (24 processes, one per tile).

Also, I found that the oneCCL versions differed between the successful and unsuccessful reduce_scatter_tensor tests above: oneCCL version 2021.11 was used with the successful torch-ccl version 2.1.10+xpu tests, which oneCCL 2021.12 was used for the OOM-ing torch-ccl 2.1.300+xpu tests. There were many changes to reduce-scatter code between these two oneCCL versions. I did not see many relevant-looking changes across the ipex and torch-ccl versions in the tables above.

xiguiw commented 1 month ago

@garrett361 Thanks for your feedback.

I am still in the process of getting the access to the cluster. I'll try to reproduce when I get the access at the first time.

garrett361 commented 1 month ago

Thanks @xiguiw, I appreciate it.

xiguiw commented 4 weeks ago

@garrett361 I got access to cluster this morning. I'll try to reproduce this problem ASAP. Thanks!

xiguiw commented 3 weeks ago

@garrett361

Seemed you have some way to set this environment for each node and device/tile. I view the code in https://gist.github.com/garrett361, did not find the way (seemed the launch program set it).

    world_size = int(os.environ["WORLD_SIZE"])
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    device = torch.device(f"xpu:{local_rank}")
    torch.xpu.set_device(device)

This is my first time to run script on multi-nodes. I'm trying to figure out how to set the rank/local_rank from other examples.

garrett361 commented 3 weeks ago

Hi @xiguiw , yes, those variables are set automatically by torchrun, if you’re using that to launch. Otherwise, they need to be set manually by your launch script based on e.g. SLURM environment variables in a SLURM setup. If you can get torchrun working, that is probably easiest; please see the docs.

xiguiw commented 3 weeks ago

Hi @garrett361,

Thank you for point me the docs I found this from the doc.

Deployment
Multi-node multi-worker: Start the launcher with the same arguments on all the nodes participating in training.
When using a job/cluster manager the entry point command to the multi-node job should be this launcher.

It asks for running torcchrun on each node. I reserved multiple nodes with qsub. It is as if only one node is visible to me (I believe there is some way to access, but I don't know it now).

I use mpiexec launching an example successfully. Anyway, I try to get it working with mpiexec.

garrett361 commented 3 weeks ago

Ah @xiguiw are you using Sunspot? If so, some guidance is available here, which is mpiexec based, but IIRC those scripts need some additional to set the env vars properly.

I just wrote out a minimal set of launch scripts here which should work, also. Usage:

Edit the launch_mpi_min.sh script to use the correct path to the set_torch_dist_env.sh wrapper.
Get two nodes: qsub -l select=2 -l walltime=00:30:00 -A Aurora_deployment -q workq -I

Set SCRIPT_PATH and ARGS as appropriate and launch. E.g.

export SCRIPT_PATH=<path-to-reduce-scatter-script>
export ARGS="--max-steps 500"
./launch_mpi_min.sh

Let me know if you hit more issues, thanks!

xiguiw commented 3 weeks ago

@garrett361 Thank you very much for provide these scripts! They are very good examples.

No one of us know about Sunspot. I don't think I am using Sunspot because I found there are lots of environment (from your script setting), but the OS and GPU model/number are the same.

with mpiexec launcher, I can get the world size, rank, local_rank, etc. from "MPI object in mpi4py object" now.

Sorry for the long dealy, I can start my work:). Will feedback to you soon.

Thank you for your help!

xiguiw commented 3 weeks ago

@garrett361 I reproduced this problem at my side on two nodes. On single node, there is no such issue (even loop 500 steps).

I'll investigate this and feedback to you for any progress. Thanks!

xiguiw commented 3 weeks ago

Hi @garrett361 ,

Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0 and try at you side?

I cannot reproduce the problem with this setting.

This seemed to be a problem of oneCCL in oneAPI 2024.1.

Development team are investigating this issue.

This could be a work-around for this issue. ‘export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0` use oneCCL kernel in 2024.0.

BR, Xigui

garrett361 commented 3 weeks ago

Ah interesting, I will try it.

This seemed to be a problem of oneCCL in oneAPI 2024.1.

Yes, looking through the diffs of all the seemingly-relevant packages, I also thought a oneCCL 2024.1 issue seemed to be the most likely root cause.

garrett361 commented 3 weeks ago

Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0 and try at you side?

I am also seeing that this setting eliminates the reduce scatter OOMs.

FYI, I also tried using this env var to address the very similar issues here, but the OOMs were not eliminated in this case. I also tried setting export CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=0, but the script still OOM-ed. Any other ideas there?

xiguiw commented 3 weeks ago

Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0 and try at you side?

I am also seeing that this setting eliminates the reduce scatter OOMs.

FYI, I also tried using this env var to address the very similar issues here, but the OOMs were not eliminated in this case. I also tried setting export CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=0, but the script still OOM-ed. Any other ideas there?

This issue is a regression. The issue was caused by some changes in oneCCL from 2024.0 to 2024.1 (oneCCL is part of oneAPI, and torch CCL wrap oneCCL. to adapt pytorch)

For this issue It's OK in oneAPI basekit 2024.0 (IPEX 2.1.10), but failed on basekit 2024.1 (IPEX 2.1.30). But Github 646 happened on both oneAPI 2024.0 and 2024.1. They are different issues.

garrett361 commented 3 weeks ago

Yes, I agree. The above experiments are extra confirmation that the current issue and #646 are are least somewhat independent, despite raising the same errors and having similar characteristics (OOMs only for sufficiently large data).

xiguiw commented 3 weeks ago

@garrett361

I verified that this issue fixed on oneAPI 2024.2.

FYI, here is the feedback from oneCCL develop team:

I can confirm that 2024.2 oneAPI release (2021.13 oneCCL/intelMPI) contains new memory management mechanisms. 2024.1 had some solutions that were not designed to work with ReduceScatter, so it might have resulted in increased memory consumption overall. I'm not sure if they are interested in our source code, but here's the new implementation which takes multiple factors into account before executing a collective and then running collective operation in smaller chunks:

https://github.com/oneapi-src/oneCCL/blob/0eb5987ab930053f81a060a007f2dbfd9f7bf7cc/src/coll/algorithms/algorithm_utils.cpp#L516
https://github.com/oneapi-src/oneCCL/blob/0eb5987ab930053f81a060a007f2dbfd9f7bf7cc/src/coll/algorithms/algorithm_utils.cpp#L357

Also, if they would like to experiment more with the memory usage, they can try using CCL_ZE_TMP_BUF_SIZE variable which allows a user to tune oneCCL topo algorithms memory consumption. By default it's set to 536870912 bytes - 5GiB. Then they could use sysmon to observe that memory usage is going down with smaller sizes of CCL_ZE_TMP_BUF_SIZE.

garrett361 commented 3 weeks ago

I'm not sure if they are interested in our source code

Definitely always interested in source code, thank you! And thank you in general @xiguiw , you've been extremely helpful and I appreciate it.

I verified that this issue fixed on oneAPI 2024.2.

That is great, glad to hear it! Would you possibly be able to also test #646 against oneAPI 2024.2? I will also try on my end, but I will need some time to set up the environment.

I will also try out the CCL_ZE_TMP_BUF_SIZE variable.

xiguiw commented 3 weeks ago

That is great, glad to hear it! Would you possibly be able to also test #646 against oneAPI 2024.2? I will also try on my end, but I will need some time to set up the environment.

Sure, I'll test #646 against oneAPI 2024.2, but a few days delay - I am preparing for a local conference in the next two days. Will feedback to you next Monday.

xiguiw commented 2 weeks ago

@garrett361 I tested #646 against oneAPI 2024.2 Feedback in #646 already.

intel / intel-extension-for-pytorch

reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage #640

Describe the bug

Versions