msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): #78

Open lengien opened 1 year ago

lengien commented 1 year ago
  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 356, in train
    r.run_forward()
  File "../runtime_3.py", line 511, in run_forward
    self._run_forward(tensors)
  File "../runtime_3.py", line 559, in _run_forward
    for input_name in input_names])
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/pipeline/runtime/image_classification/models/alexnet/gpus=4_straight/stage2.py", line 25, in forward
    out5 = self.layer5(out4)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward
    self.padding, self.dilation, self.groups)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 407, in train
    r.run_backward()
  File "../runtime_3.py", line 648, in run_backward
    for output_name in outputs]))
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The logic of Pipedream is that some stages will perform multiple forward passes before performing one backward pass. It seems that there may be issues with this in the new version of Torch. I would like to ask how to avoid this problem.

Versions Collecting environment information... PyTorch version: 1.11.0+cu115 Is debug build: False CUDA used to build PyTorch: 11.5 ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS (x86_64) GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 Clang version: Could not collect CMake version: version 3.5.1 Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-4.15.0-204-generic-x86_64-with-debian-stretch-sid Is CUDA available: True CUDA runtime version: 10.1.163 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: Tesla P100-PCIE-12GB GPU 1: Tesla P100-PCIE-12GB GPU 2: Tesla P100-PCIE-12GB GPU 3: Tesla P100-PCIE-12GB GPU 4: Tesla P100-PCIE-12GB GPU 5: Tesla P100-PCIE-12GB GPU 6: Tesla P100-PCIE-12GB GPU 7: Tesla P100-PCIE-12GB

Nvidia driver version: 515.65.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 20 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2200.102 BogoMIPS: 4404.71 Hypervisor vendor: vertical Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-19 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.3.2 [pip3] numpy==1.21.5 [pip3] torch==1.11.0+cu115 [pip3] torchvision==0.12.0+cu115 [conda] cudatoolkit 10.2.89 hfd86e86_1 [conda] magma-cuda100 2.1.0 5 local [conda] mkl 2019.1 144 [conda] mkl-include 2019.1 144 [conda] msgpack-numpy 0.4.3.2 py37_0 [conda] numpy 1.21.5 py37h7a5d4dd_2 [conda] numpy-base 1.21.5 py37hb8be1f0_2 [conda] torch 1.11.0+cu115 pypi_0 pypi [conda] torchvision 0.12.0+cu115 pypi_0 pypi

xglds99 commented 1 year ago

I also met the same problem. Have you solved it? If you have solved it, I hope you can tell me your solution. Thank you

lengien commented 1 year ago

I also met the same problem. Have you solved it? If you have solved it, I hope you can tell me your solution. Thank you I found a new framework pippy,https://github.com/pytorch/PiPPy,which can use by torch2.0

hnust-xxq commented 1 month ago

Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code

lengien commented 1 month ago

Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3

... @.***

 

------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @.**@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78)

Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hnust-xxq commented 1 month ago

Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 ... @.   ------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @*.**@*.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

I'm using a cloud server, and I'm allocated a Docker container. Can I download another Docker and use Docker commands? I don't think that's possible, right? Do you know how to work with cloud servers?

lengien commented 1 month ago

try with torch==1.1.0 or try another pipeline library, pippy.pippy can be run with a new version of torch 

... @.***

 

------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:57 @.>; @.**@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78)

Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 ... @.   … ------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @.@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

I'm using a cloud server, and I'm allocated a Docker container. Can I download another Docker and use Docker commands? I don't think that's possible, right? Do you know how to work with cloud servers?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hnust-xxq commented 1 month ago

Could you please tell me the version of your GPU server? I only have the Docker container of the server and cannot use Docker to create environments, so I need to download all the dependencies myself. Thank you! My English is not very good, so I apologize if I offend you in any way.

try with torch==1.1.0 or try another pipeline library, pippy.pippy can be run with a new version of torch  ... @.   ------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:57 @.>; @*.**@*.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 ... @.   … ------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @.@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> I'm using a cloud server, and I'm allocated a Docker container. Can I download another Docker and use Docker commands? I don't think that's possible, right? Do you know how to work with cloud servers? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

hnust-xxq commented 1 month ago

Is PyTorch v1.1.0 feasible? The command nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 corresponds to PyTorch v1.0.0. Did you encounter any issues when applying the patch

lengien commented 1 month ago

PyTorch v1.1.0 is feasible in my enviroment

------------------ Original ------------------ From: hnust-xxq @.> Date: Sat,Oct 26,2024 4:32 PM To: msr-fiddle/pipedream @.> Cc: lengien @.>, Author @.> Subject: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78)

Is PyTorch v1.1.0 feasible? The command nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 corresponds to PyTorch v1.0.0. Did you encounter any issues when applying the patch

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>