Closed ZhuoranLyu closed 7 months ago
Also tested on Nvidia GPUs, wondering why it compares a tensor on CPU with a tensor on GPU, confusing:( https://github.com/microsoft/DeepSpeed/blob/870ae041d42190be8139afc12bef51d6ed7719f3/deepspeed/runtime/zero/stage3.py#L2081C37-L2081C37
If I move the second tensor to NPU using .to('npu'), it works smoothly.
If I move the second tensor to NPU using .to('npu'), it works smoothly.
Tensors in one op should be in the same device. Both of GPU and NPU could raise error, because the second tensor is on CPU.
If I move the second tensor to NPU using .to('npu'), it works smoothly.
Tensors in one op should be in the same device. Both of GPU and NPU could raise error, because the second tensor is on CPU.
However, it works fine on Nvidia GPU. Really strange.
On GPU:
>>> import torch
>>> a = torch.zeros(1, device="cpu")
>>> b = torch.zeros(1, device="cuda")
>>> b.logical_or_(torch.isinf(a))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
On GPU:
>>> import torch >>> a = torch.zeros(1, device="cpu") >>> b = torch.zeros(1, device="cuda") >>> b.logical_or_(torch.isinf(a)) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
...........I totally understand, but if you check the tensor in https://github.com/microsoft/DeepSpeed/blob/870ae041d42190be8139afc12bef51d6ed7719f3/deepspeed/runtime/zero/stage3.py#L2081C37-L2081C37 you'll find two tensors on different devices.
This also confused me. Don't know why the GPU can run.
Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz
Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz
@ZhuoranLyu Please be noted that these are nightly build versions that may give a workaround on your issue. Remember to switch to a stable release version latter.
Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz
Thanks a lot!
Describe the bug Training Baichuan13B model failed on Huawei platform, using Deepspeed Stage 3 with CPU offload.
To Reproduce Steps to reproduce the behavior:
Expected behavior Training
ds_report output
Screenshots
It looks like it's trying to compare a tensor on CPU with another tensor on GPU/NPU?
System info (please complete the following information):
Launcher context
Docker context no
Additional context no