pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
81.53k stars 21.87k forks source link

Compiled model cannot forward for pytorch 2.0 #90141

Closed sweetice closed 1 year ago

sweetice commented 1 year ago

🐛 Describe the bug

Hello, I download pytorch2.0 and play with the toy example

import torch
import torchvision.models as models

import faulthandler
faulthandler.enable()

model = models.resnet18().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
compiled_model = torch.compile(model)
print("-"*30)
print("Compile Successful")
print("-"*30)
x = torch.randn(16, 3, 224, 224).cuda()
optimizer.zero_grad()
out = compiled_model(x)
out.sum().backward()
optimizer.step()

print("-"*30)
print("Backward")
print("-"*30)

It is said

(rl) CUDA_VISIBLE_DEVICES=3 python t.py
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE
  warn(f"Failed to load image Python extension: {e}")
------------------------------
Compile Successful
------------------------------
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
pedStorage will be the only storage class. This should only matter to you if you are using storages directly.                                                             [65/1957]
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
Fatal Python error: Segmentation fault

Thread 0x00007fa985aa5700 (most recent call first):
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 302 in wait
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/multiprocessing/queues.py", line 227 in _feed
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 870 in run
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fa9832a4700 (most recent call first):
<no Python frame>

Thread 0x00007fa980aa3700 (most recent call first):
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/selectors.py", line 415 in select
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/multiprocessing/connection.py", line 931 in wait
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/concurrent/futures/process.py", line 362 in _queue_management_worker
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 870 in run
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007faa9c142740 (most recent call first):
  File "<string>", line 4 in launcher
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/_inductor/triton_ops/autotune.py", line 177 in run
  File "/tmp/torchinductor_mzy813/yl/cylr2qlilp4g5atdhd6i7nugnjapj56nekvjqhksr4dfp2nx4hde.py", line 1689 in call
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/_inductor/compile_fx.py", line 203 in run
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 836 in call_func_with_args
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 1455 in forward
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 1551 in compiled_function
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 1687 in debug_compiled_function
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 811 in g
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 2107 in forward
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 209 in _fn
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torchvision/models/resnet.py", line 284 in forward
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 209 in _fn
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 80 in forward
  File "/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1480 in _call_impl
  File "t.py", line 15 in <module>
Segmentation fault

It means we can compile the model, but cannot perform forward for the model.

Versions

I compile gcc in the cluster. And Pytorch is also run in cluster.

GCC

(rl) gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/mnt/qb/work/maghsudi/mzy813/installsoftware/gcc10-4/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/10.4.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-10.4.0/configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.4.0 (GCC)

CUDA:

(rl) nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
[mzy813@slurm-bm-83 ~]$nvidia-smi
Sun Dec  4 20:17:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
|  0%   49C    P2    51W / 250W |    890MiB / 11264MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:3E:00.0 Off |                  N/A |
|  0%   29C    P8    21W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:60:00.0 Off |                  N/A |
|  0%   29C    P8    21W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:61:00.0 Off |                  N/A |
|  0%   29C    P8     1W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:B1:00.0 Off |                  N/A |
|  0%   24C    P8     1W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:B2:00.0 Off |                  N/A |
|  0%   26C    P8     1W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:DA:00.0 Off |                  N/A |
|  0%   26C    P8    17W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:DB:00.0 Off |                  N/A |
|  0%   25C    P8    16W / 250W |      3MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155958      C   python                            887MiB |
+-----------------------------------------------------------------------------+

All necessary imports at the beginning

Package                 Version               Editable project location
----------------------- --------------------- ---------------------------------------------
absl-py                 1.3.0
ale-py                  0.7.5
antlr4-python3-runtime  4.8
asttokens               2.1.0
atari-py                0.2.9
backcall                0.2.0
cachetools              5.2.0
certifi                 2022.9.24
cffi                    1.15.1
charset-normalizer      2.1.1
chex                    0.1.5
click                   8.1.3
cloudpickle             1.6.0
cmake                   3.25.0
commonmark              0.9.1
configparser            5.3.0
contourpy               1.0.6
cycler                  0.11.0
Cython                  0.29.32
DateTime                4.7
decorator               4.4.2
dm-control              1.0.7
dm-env                  1.5
dm-tree                 0.1.7
dmc2gym                 1.0.0                 /mnt/qb/work/maghsudi/mzy813/software/dmc2gym
docker-pycreds          0.4.0
executing               1.2.0
fasteners               0.18
filelock                3.8.0
flax                    0.6.2
fonttools               4.38.0
gitdb                   4.0.9
GitPython               3.1.29
glfw                    2.5.5
glibc                   0.6.1
google-auth             2.12.0
google-auth-oauthlib    0.4.6
graphql-core            3.2.3
grpcio                  1.49.1
gym                     0.19.0
gym-notices             0.0.8
Hydra                   2.5
hydra-core              1.1.0
hydra-submitit-launcher 1.1.5
idna                    3.4
imageio                 2.22.1
imageio-ffmpeg          0.4.7
importlib-metadata      5.0.0
importlib-resources     5.10.0
ipdb                    0.13.9
ipython                 8.6.0
jax                     0.3.25
jaxlib                  0.3.25+cuda11.cudnn82
jedi                    0.18.1
Jinja2                  3.1.2
kiwisolver              1.4.4
kornia                  0.6.8
labmaze                 1.0.5
lxml                    4.9.1
Markdown                3.4.1
MarkupSafe              2.1.1
matplotlib              3.6.2
matplotlib-inline       0.1.6
moviepy                 1.0.3
mpmath                  1.2.1
msgpack                 1.0.4
mujoco                  2.3.0
mujoco-py               2.1.2.14
networkx                3.0rc1
numpy                   1.24.0rc1
oauthlib                3.2.1
omegaconf               2.1.2
opencv-python           4.6.0.66
opt-einsum              3.3.0
optax                   0.1.4
packaging               21.3
pandas                  1.5.0
parso                   0.8.3
patchelf                0.15.0.0
pathtools               0.1.2
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  9.2.0
pip                     22.2.2
proglog                 0.1.10
promise                 2.3
prompt-toolkit          3.0.32
protobuf                3.19.6
psutil                  5.9.4
ptyprocess              0.7.0
pure-eval               0.2.2
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pybullet                3.2.5
pycparser               2.21
Pygments                2.13.0
PyOpenGL                3.1.6
pyparsing               2.4.7
python-dateutil         2.8.2
pytz                    2022.4
PyWavelets              1.4.1
PyYAML                  6.0
requests                2.28.1
requests-oauthlib       1.3.1
rich                    12.6.0
rsa                     4.9
scikit-image            0.19.3
scipy                   1.9.2
sentry-sdk              1.10.1
setuptools              65.5.0
shortuuid               1.0.10
six                     1.16.0
sklearn                 0.0.post1
smmap                   5.0.0
stack-data              0.6.0
submitit                1.4.5
sympy                   1.11.1
tb-nightly              2.12.0a20221110
tensorboard             2.10.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
tensorstore             0.1.28
termcolor               2.0.1
tifffile                2022.10.10
toml                    0.10.2
toolz                   0.12.0
torchtriton             2.0.0+0d7e753227
torchvision             0.13.1
tqdm                    4.64.1
traitlets               5.5.0
tsnecuda                3.0.1
typing_extensions       4.4.0
urllib3                 1.26.12
wandb                   0.11.1
wcwidth                 0.2.5
Werkzeug                2.2.2
wheel                   0.37.1
yapf                    0.31.0
zipp                    3.9.0
zope.interface          5.5.0

Versions

[mzy813@slurm-bm-83 ~]$python collect_env.py
Collecting environment information...
^Z
[1]+  Stopped                 python collect_env.py
[mzy813@slurm-bm-83 ~]$conda activate rl
(rl) python collect_env.py
Collecting environment information...
PyTorch version: 1.14.0.dev20221204+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 10.4.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.17

Python version: 3.8.10 | packaged by conda-forge | (default, Sep 13 2021, 21:46:58)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 2080 Ti
GPU 4: NVIDIA GeForce RTX 2080 Ti
GPU 5: NVIDIA GeForce RTX 2080 Ti
GPU 6: NVIDIA GeForce RTX 2080 Ti
GPU 7: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.0rc1
[pip3] torch==1.14.0.dev20221204+cu117
[pip3] torchtriton==2.0.0+0d7e753227
[pip3] torchvision==0.13.1
[conda] cudatoolkit               11.7.0              hd8887f6_10    conda-forge
[conda] numpy                     1.24.0rc1                pypi_0    pypi
[conda] torch                     1.14.0.dev20221204+cu117          pypi_0    pypi
[conda] torchtriton               2.0.0+0d7e753227          pypi_0    pypi
[conda] torchvision               0.13.1                   pypi_0    pypi

verify_dynamo.py

(rl) python verify_dynamo.py
Python version: 3.8.10
`torch` version: 1.14.0.dev20221204+cu117
CUDA version: 11.7

/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE
  warn(f"Failed to load image Python extension: {e}")
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/mnt/qb/work/maghsudi/mzy813/anaconda3/envs/rl/lib/python3.8/site-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
All required checks passed

cc @ezyang @gchanan @zou3519 @soumith

wconstab commented 1 year ago

i didn't see a segfault on tip but i am on cuda 11.6. Does minifier work correctly when segfaults are involved? if so it might help to run that since it looks like the stack trace implicates an inductor/triton kernel.

ezyang commented 1 year ago

I see torchvision is 0.13.1 which is probably the problem. Uninstall it and reinstall using the instructions from https://pytorch.org/get-started/pytorch-2.0/#getting-started

sweetice commented 1 year ago

Installing numpy<1.24 solves this issue. I don't know why.