Open garrett361 opened 1 month ago
Other notes:
all_gather_into_tensor
, but did not manage to trigger an OOM.This seems to be a versioning and/or environment issue.
Discovered that when I revert to an older environment, the reduce_scatter_tensor
test works fine.
Diff of the results from running collect_env.py
in the two environments:
--- collect_env_frameworks_2024.04.15.002.txt 2024-05-29 19:13:16.000000000 +0000
+++ collect_env_frameworks_2023.12.15.001.txt 2024-05-29 23:46:40.000000000 +0000
@@ -1,14 +1,14 @@
Collecting environment information...
-PyTorch version: 2.1.0.post2+cxx11.abi
+PyTorch version: 2.1.0a0+cxx11.abi
PyTorch CXX11 ABI: Yes
-IPEX version: 2.1.30+xpu
-IPEX commit: 474a6b3cb
+IPEX version: 2.1.10+xpu
+IPEX commit: a12f9f650
Build type: Release
OS: SUSE Linux Enterprise Server 15 SP4 (x86_64)
GCC version: (Spack GCC) 12.2.0
Clang version: N/A
-IGC version: 2024.1.0 (2024.1.0.20240308)
+IGC version: 2024.0.0 (2024.0.0.20231017)
CMake version: version 3.27.5
Libc version: glibc-2.31
@@ -18,18 +18,18 @@
DPCPP runtime version: latest
MKL version: latest
GPU models and configuration:
-[0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
-[11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
+[11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448)
Intel OpenCL ICD version: N/A
Level Zero version: N/A
@@ -76,25 +76,25 @@
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
-[pip3] intel-extension-for-pytorch==2.1.30+xpu
+[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] numpy==1.23.5
-[pip3] torch==2.1.0.post2+cxx11.abi
-[pip3] torchvision==0.16.0.post2+cxx11.abi
-[conda] intel-extension-for-pytorch 2.1.30+xpu pypi_0 pypi
-[conda] mkl 2024.1.0 intel_642 intel
-[conda] mkl-dpcpp 2024.1.0 intel_642 intel
-[conda] mkl-service 2.4.0 py39hc591bdc_44 intel
-[conda] mkl_fft 1.3.8 py39h6b114c4_70 intel
-[conda] mkl_random 1.2.4 py39h841069b_90 intel
-[conda] mkl_umath 0.1.1 py39h843e89b_100 intel
+[pip3] torch==2.1.0a0+cxx11.abi
+[pip3] torchvision==0.16.2
+[conda] intel-extension-for-pytorch 2.1.10+xpu pypi_0 pypi
+[conda] mkl 2024.0.0 intel_49630 intel
+[conda] mkl-dpcpp 2024.0.0 intel_49630 intel
+[conda] mkl-service 2.4.0 py39h3539a15_40 intel
+[conda] mkl_fft 1.3.6 py39h1d81ff8_61 intel
+[conda] mkl_random 1.2.2 py39h5a378b4_81 intel
+[conda] mkl_umath 0.1.1 py39h2b1685c_91 intel
[conda] numpy 1.23.5 pypi_0 pypi
-[conda] onemkl-sycl-blas 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-datafitting 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-dft 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-lapack 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-rng 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-sparse 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-stats 2024.1.0 intel_642 intel
-[conda] onemkl-sycl-vm 2024.1.0 intel_642 intel
-[conda] torch 2.1.0.post2+cxx11.abi pypi_0 pypi
-[conda] torchvision 0.16.0.post2+cxx11.abi pypi_0 pypi
+[conda] onemkl-sycl-blas 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-datafitting 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-dft 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-lapack 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-rng 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-sparse 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-stats 2024.0.0 intel_49630 intel
+[conda] onemkl-sycl-vm 2024.0.0 intel_49630 intel
+[conda] torch 2.1.0a0+cxx11.abi pypi_0 pypi
+[conda] torchvision 0.16.2 pypi_0 pypi
Not sure yet which changes are causing the OOMs.
Maybe related: I'm also finding that if I replace reduce_scatter_tensor
with a plain reduce_scatter
, I hit:
RuntimeError: ProcessGroupCCL does not support reduce_scatter
This happens in both environments referenced above. Code for all tests can be found here (similar to the above).
@garrett361
Thanks for trying IPEX.
what(): oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
2024:05:29-19:16:18:(202166) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_call.cpp:28 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
terminate called after throwing an instance of 'ccl::v1::exception'
The log show CCL, I did not find you oneccl version. Could you list your ccl version?
Here, for each IPEX version, there is one ccl version. https://github.com/intel/torch-ccl/tree/v2.1.300%2Bxpu?tab=readme-ov-file#pytorch-api-align
intel-extension-for-pytorch 2.1.10+xpu intel-extension-for-pytorch 2.1.20+xpu intel-extension-for-pytorch 2.1.30+xpu
v2.1.0 ccl_torch2.1.300 v2.1.0 ccl_torch2.1.200 v2.1.0 ccl_torch2.1.100
Discovered that when I revert to an older environment, the reduce_scatter_tensor test works fine. Diff of the results from running collect_env.py in the two environments:
So if you install a new IPEX version, oneCCl need to be updated, too.
Hi all,
Thanks @garrett361 for another bug report! I want to confirm I have reproduced this on Sunspot, with the 2024.1 oneAPI release and corresponding ipex.
Oneccl is linked to the version from 2024.1:
❯ ldd /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/oneccl_bindings_for_pytorch/_C.cpython-39-x86_64-linux-gnu.so | grep ccl
liboneccl_bindings_for_pytorch.so => /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so (0x000015512f77e000)
libccl.so.1 => /soft/compilers/oneapi/2024.04.15.001/oneapi/ccl/latest/lib/libccl.so.1 (0x000015510f612000)
I ran this on 4 nodes with 6 PVC each, so 48 ranks. Like @garrett361 , I could not reproduce this on a single node but required cross-node communication before things break.
Thanks @coreyjadams !
Also, while I reported that the reduce_scatter_tensor
OOM goes away when reverting to a previous environment on Sunspot (frameworks/2023.12.15.001
), I'm still finding that the same ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
gets triggered in that environment when performing multi-node FSDP training. Presumably due to some other collective, but TBD.
I will post follow-up issues once I have a better handle on what's going on.
I also notice a dramatic timing difference between doing 12 ranks on one node (very very fast) and 12 ranks over 2 nodes (a lot slower). Yes, bandwith is not the same intranode vs. internode, but the difference is dramatic. It's about 0.22 seconds per step with 12 ranks over 2 nodes, and about 0.02s with all 12 ranks on one node.
@coreyjadams I don’t recall the Sunspot bandwidth numbers exactly, but an order of magnitude drop off for twelve processes per node sounds reasonable to me.
Thought an order of magnitude change in going from one node to multi-node was typical, IIRC
@garrett361 @coreyjadams
Did you upgrade your driver when you upgrade IPEX from 2.1.10+xpu to 2.1.30+xpu?
IPEX 2.1.30+xpu require driver LTS 803.29, while IPEX 2.1.10+xpu require driver 736.25
@-PyTorch version: 2.1.0.post2+cxx11.abi
+PyTorch version: 2.1.0a0+cxx11.abi
PyTorch CXX11 ABI: Yes
-IPEX version: 2.1.30+xpu
-IPEX commit: 474a6b3cb
+IPEX version: 2.1.10+xpu
+IPEX commit: a12f9f650
Build type: Release
I am trying to get access multi-node environment to reproduce this problem. I'll try to reproduce this (at least can run this at my side). Please expect slow response.
Meanwhile, IPEX release a new wheel package 2.1.30.post0
with some known issue fixed, one of them is memory leak.
Would you please try this new package at your side and give your feedback?
Thanks!
Here is the new package updated in https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu&os=linux%2fwsl2&package=pip)
Thanks for the pointer @xiguiw ! I'm not sure what driver version was used; will figure it out, test out the new wheel and get back to you ASAP.
Could you list your ccl version?
Running pip freeze
in the new environment where reduce_scatter_tensor
failed gives:
torch==2.1.0.post2+cxx11.abi
oneccl-bind-pt==2.1.300+xpu
intel-extension-for-pytorch==2.1.30+xpu
The old environment in which reduce_scatter_tensor
does not OOM has
torch==2.1.0a0+cxx11.abi
oneccl-bind-pt==2.1.100+xpu
intel-extension-for-pytorch==2.1.10+xpu
@xiguiw is this the info you wanted regarding ccl above?
Some updates:
module use /soft/preview-modulefiles/24.086.0
module load intel_compute_runtime/release/803.29
(Is there some command to more programmatically confirm which drivers I'm using?)
pip freeze
outputs include:torch==2.1.0.post2+cxx11.abi
oneccl-bind-pt==2.1.300+xpu
intel_extension_for_pytorch==2.1.30.post0
With the new ipex
wheel, the results remain the same as initially reported:
reduce_scatter_tensor
OOMs (and reduce_scatter
gives a RuntimeError
).
Summary table:
torch/torch-ccl/ipex Versions |
reduce_scatter_tensor Result |
reduce_scatter Result |
---|---|---|
2.1.0a0+cxx11.abi/2.1.10+xpu/2.1.100+xpu | Success | RuntimeError: ProcessGroupCCL does not support reduce_scatter |
2.1.0.post2+cxx11.abi/2.1.300+xpu/2.1.30+xpu | ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY |
RuntimeError: ProcessGroupCCL does not support reduce_scatter |
2.1.0.post2+cxx11.abi/2.1.300+xpu/2.1.30.post0 | ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY |
RuntimeError: ProcessGroupCCL does not support reduce_scatter |
I started using this code to test the various collectives.
@garrett361 Yes,this is the CCL version I wanted.
Thanks for confirm the issue on the new release package. We will look into this issue and let you know the progress.
Updates: the OOM only seems to appear for sufficiently large message sizes. If I run the same script but with 2 ** 28
elements in the reduce-scatter (rather than 2 ** 30
), no OOM is hit, even after many iterations. The threshold for OOMing is somewhere around ~1GiB on the two-node, 1550 gpu setup I'm using (24 processes, one per tile).
Also, I found that the oneCCL versions differed between the successful and unsuccessful reduce_scatter_tensor
tests above: oneCCL version 2021.11
was used with the successful torch-ccl
version 2.1.10+xpu
tests, which oneCCL 2021.12
was used for the OOM-ing torch-ccl
2.1.300+xpu
tests. There were many changes to reduce-scatter code between these two oneCCL versions. I did not see many relevant-looking changes across the ipex
and torch-ccl
versions in the tables above.
@garrett361 Thanks for your feedback.
I am still in the process of getting the access to the cluster. I'll try to reproduce when I get the access at the first time.
Thanks @xiguiw, I appreciate it.
@garrett361 I got access to cluster this morning. I'll try to reproduce this problem ASAP. Thanks!
@garrett361
Seemed you have some way to set this environment for each node and device/tile. I view the code in https://gist.github.com/garrett361, did not find the way (seemed the launch program set it).
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"xpu:{local_rank}")
torch.xpu.set_device(device)
This is my first time to run script on multi-nodes. I'm trying to figure out how to set the rank/local_rank from other examples.
Hi @xiguiw , yes, those variables are set automatically by torchrun
, if you’re using that to launch. Otherwise, they need to be set manually by your launch script based on e.g. SLURM environment variables in a SLURM setup. If you can get torchrun
working, that is probably easiest; please see the docs.
Hi @garrett361,
Thank you for point me the docs I found this from the doc.
Deployment
Multi-node multi-worker: Start the launcher with the same arguments on all the nodes participating in training.
When using a job/cluster manager the entry point command to the multi-node job should be this launcher.
It asks for running torcchrun
on each node.
I reserved multiple nodes with qsub
. It is as if only one node is visible to me (I believe there is some way to access, but I don't know it now).
I use mpiexec
launching an example successfully.
Anyway, I try to get it working with mpiexec
.
Ah @xiguiw are you using Sunspot? If so, some guidance is available here, which is mpiexec
based, but IIRC those scripts need some additional to set the env vars properly.
I just wrote out a minimal set of launch scripts here which should work, also. Usage:
launch_mpi_min.sh
script to use the correct path to the set_torch_dist_env.sh
wrapper.qsub -l select=2 -l walltime=00:30:00 -A Aurora_deployment -q workq -I
SCRIPT_PATH
and ARGS
as appropriate and launch. E.g.
export SCRIPT_PATH=<path-to-reduce-scatter-script>
export ARGS="--max-steps 500"
./launch_mpi_min.sh
Let me know if you hit more issues, thanks!
@garrett361 Thank you very much for provide these scripts! They are very good examples.
No one of us know about Sunspot. I don't think I am using Sunspot because I found there are lots of environment (from your script setting), but the OS and GPU model/number are the same.
with mpiexec launcher, I can get the world size, rank, local_rank, etc. from "MPI object in mpi4py object" now.
Sorry for the long dealy, I can start my work:). Will feedback to you soon.
Thank you for your help!
@garrett361 I reproduced this problem at my side on two nodes. On single node, there is no such issue (even loop 500 steps).
I'll investigate this and feedback to you for any progress. Thanks!
Hi @garrett361 ,
Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0
and try at you side?
I cannot reproduce the problem with this setting.
This seemed to be a problem of oneCCL in oneAPI 2024.1.
Development team are investigating this issue.
This could be a work-around for this issue. ‘export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0` use oneCCL kernel in 2024.0.
BR, Xigui
Ah interesting, I will try it.
This seemed to be a problem of oneCCL in oneAPI 2024.1.
Yes, looking through the diffs of all the seemingly-relevant packages, I also thought a oneCCL 2024.1 issue seemed to be the most likely root cause.
Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0 and try at you side?
I am also seeing that this setting eliminates the reduce scatter OOMs.
FYI, I also tried using this env var to address the very similar issues here, but the OOMs were not eliminated in this case. I also tried setting export CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=0
, but the script still OOM-ed. Any other ideas there?
Would you please set this export CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL=0 and try at you side?
I am also seeing that this setting eliminates the reduce scatter OOMs.
FYI, I also tried using this env var to address the very similar issues here, but the OOMs were not eliminated in this case. I also tried setting
export CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL=0
, but the script still OOM-ed. Any other ideas there?
This issue is a regression. The issue was caused by some changes in oneCCL from 2024.0 to 2024.1 (oneCCL is part of oneAPI, and torch CCL wrap oneCCL. to adapt pytorch)
For this issue It's OK in oneAPI basekit 2024.0 (IPEX 2.1.10), but failed on basekit 2024.1 (IPEX 2.1.30). But Github 646 happened on both oneAPI 2024.0 and 2024.1. They are different issues.
Yes, I agree. The above experiments are extra confirmation that the current issue and #646 are are least somewhat independent, despite raising the same errors and having similar characteristics (OOMs only for sufficiently large data).
@garrett361
I verified that this issue fixed on oneAPI 2024.2.
FYI, here is the feedback from oneCCL develop team:
I can confirm that 2024.2 oneAPI release (2021.13 oneCCL/intelMPI) contains new memory management mechanisms. 2024.1 had some solutions that were not designed to work with ReduceScatter, so it might have resulted in increased memory consumption overall. I'm not sure if they are interested in our source code, but here's the new implementation which takes multiple factors into account before executing a collective and then running collective operation in smaller chunks:
https://github.com/oneapi-src/oneCCL/blob/0eb5987ab930053f81a060a007f2dbfd9f7bf7cc/src/coll/algorithms/algorithm_utils.cpp#L516
https://github.com/oneapi-src/oneCCL/blob/0eb5987ab930053f81a060a007f2dbfd9f7bf7cc/src/coll/algorithms/algorithm_utils.cpp#L357
Also, if they would like to experiment more with the memory usage, they can try using CCL_ZE_TMP_BUF_SIZE variable which allows a user to tune oneCCL topo algorithms memory consumption. By default it's set to 536870912 bytes - 5GiB. Then they could use sysmon to observe that memory usage is going down with smaller sizes of CCL_ZE_TMP_BUF_SIZE.
I'm not sure if they are interested in our source code
Definitely always interested in source code, thank you! And thank you in general @xiguiw , you've been extremely helpful and I appreciate it.
I verified that this issue fixed on oneAPI 2024.2.
That is great, glad to hear it! Would you possibly be able to also test #646 against oneAPI 2024.2? I will also try on my end, but I will need some time to set up the environment.
I will also try out the CCL_ZE_TMP_BUF_SIZE
variable.
That is great, glad to hear it! Would you possibly be able to also test #646 against oneAPI 2024.2? I will also try on my end, but I will need some time to set up the environment.
Sure, I'll test #646 against oneAPI 2024.2, but a few days delay - I am preparing for a local conference in the next two days. Will feedback to you next Monday.
Describe the bug
Repeated calls into
torch.dist.reduce_scatter_tensor
eventually raise aZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
error in multi-node setups. Similar behavior is found when using Fully Sharded Data Parallel, which calls intoreduce_scatter_tensor
internally.Script to reproduce is below. Steps:
reduce_scatter_tensor
and print out memory readings at each stepExample logs:
The behavior seems specific to multi-node setups. I have not seen the same error raised on a single node.
Versions
Collecting environment information... PyTorch version: 2.1.0.post2+cxx11.abi PyTorch CXX11 ABI: Yes IPEX version: 2.1.30+xpu IPEX commit: 474a6b3cb Build type: Release
OS: SUSE Linux Enterprise Server 15 SP4 (x86_64) GCC version: (Spack GCC) 12.2.0 Clang version: N/A IGC version: 2024.1.0 (2024.1.0.20240308) CMake version: version 3.27.5 Libc version: glibc-2.31
Python version: 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-5.14.21-150400.24.55-default-x86_64-with-glibc2.31 Is XPU available: True DPCPP runtime version: latest MKL version: latest GPU models and configuration: [0] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [1] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [2] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [3] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [4] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [5] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [6] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [7] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [8] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [9] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [10] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) [11] _DeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=1, total_memory=65536MB, max_compute_units=448, gpu_eu_count=448) Intel OpenCL ICD version: N/A Level Zero version: N/A
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU Max 9470C CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 52 Socket(s): 2 Stepping: 8 Frequency boost: enabled CPU max MHz: 2001.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 amx_tile flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 4.9 MiB (104 instances) L1i cache: 3.3 MiB (104 instances) L2 cache: 208 MiB (104 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-51,104-155 NUMA node1 CPU(s): 52-103,156-207 NUMA node2 CPU(s): NUMA node3 CPU(s): Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.1.30+xpu [pip3] numpy==1.23.5 [pip3] torch==2.1.0.post2+cxx11.abi [pip3] torchvision==0.16.0.post2+cxx11.abi [conda] intel-extension-for-pytorch 2.1.30+xpu pypi_0 pypi [conda] mkl 2024.1.0 intel_642 intel [conda] mkl-dpcpp 2024.1.0 intel_642 intel [conda] mkl-service 2.4.0 py39hc591bdc_44 intel [conda] mkl_fft 1.3.8 py39h6b114c4_70 intel [conda] mkl_random 1.2.4 py39h841069b_90 intel [conda] mkl_umath 0.1.1 py39h843e89b_100 intel [conda] numpy 1.23.5 pypi_0 pypi [conda] onemkl-sycl-blas 2024.1.0 intel_642 intel [conda] onemkl-sycl-datafitting 2024.1.0 intel_642 intel [conda] onemkl-sycl-dft 2024.1.0 intel_642 intel [conda] onemkl-sycl-lapack 2024.1.0 intel_642 intel [conda] onemkl-sycl-rng 2024.1.0 intel_642 intel [conda] onemkl-sycl-sparse 2024.1.0 intel_642 intel [conda] onemkl-sycl-stats 2024.1.0 intel_642 intel [conda] onemkl-sycl-vm 2024.1.0 intel_642 intel [conda] torch 2.1.0.post2+cxx11.abi pypi_0 pypi [conda] torchvision 0.16.0.post2+cxx11.abi pypi_0 pypi