Open BowenBao opened 2 years ago
I am unable to reproduce on my side. How long have you been facing this? I wonder if this could be related to the rework at https://github.com/pytorch/pytorch/pull/70009. There is an outstanding PR https://github.com/pytorch/vision/pull/5299 to address the changes.
hi @datumbox, I'm seeing this most recently with both master/main branch on pytorch and vision. Just now I tried again with both latest branch, pytorch @ 31b348411afa608639a2f7353060974c849829dd and vision @ 435eddf7a8200cc26338036a0a5f7db067ac7b0c. The below short script results in segfault:
import torch
import torchvision
model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True, quantize=True)
print(model)
print(model.state_dict())
Removing print(model)
, I would get arbitrary results of the following
vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
"must run observer before calling calculate_qparams. " +
Traceback (most recent call last):
File "export_pt_mobilenet_v3_quant.py", line 49, in convert_mobilenet
print(model.state_dict())
File "pytorch/torch/nn/parameter.py", line 37, in __repr__
return 'Parameter containing:\n' + super(Parameter, self).__repr__()
File "pytorch/torch/_tensor.py", line 294, in __repr__
return torch._tensor_str._str(self)
File "pytorch/torch/_tensor_str.py", line 434, in _str
return _str_intern(self)
File "pytorch/torch/_tensor_str.py", line 317, in _str_intern
if self.device.type != torch._C._get_default_device()\
RuntimeError: tensor does not have a device
or
vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
"must run observer before calling calculate_qparams. " +
Segmentation fault (core dumped)
or
vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
"must run observer before calling calculate_qparams. " +
Traceback (most recent call last):
File "export_pt_mobilenet_v3_quant.py", line 49, in convert_mobilenet
print(model.state_dict())
KeyError: 'features.6.block.3.0.bias'
from running
import torch
import torchvision
model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True, quantize=True)
print(model.state_dict())
@BowenBao Thanks for keep looking at it.
@jerryzh168 this might be related to the changes done at https://github.com/pytorch/pytorch/pull/71956
Unfortunately I still can not reproduce the problem using the latest TorchVision main and pytorch 1.11.0.dev20220202 py3.9_cuda11.1_cudnn8.0.5_0 pytorch-nightly
. Running the above script on a linux dev-server yields the results as expected.
The other crazy thing is that on the same nightly on macOSX pytorch 1.11.0.dev20220202 py3.8_0 pytorch-nightly
, it seems we are missing the new Quantization API. Here is the error:
method = torch.ao.quantization.fuse_modules_qat if is_qat else torch.ao.quantization.fuse_modules
AttributeError: module 'torch.ao.quantization' has no attribute 'fuse_modules_qat'
Jerry any thoughts?
looks like a problem with version control, fuse_modules_qat is introduced in the most recent master, maybe it's not picked up by the nightlies yet? It is in master right now: https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/__init__.py#L5
@datumbox created a fresh conda environment and tried similar nightly branch, I could still repro ...
Could it be due to different pretrained model checkpoint?
Edit: nvm different checkpoint, could repro with pretrained=False
too .. But it is very random. If I comment out the image loading part, and use a random tensor as input, pretrained=False
would succeed in the initial repro code. pretrained=True
still segfaults.
Logs
Downloading: "https://download.pytorch.org/models/quantized/mobilenet_v3_large_qnnpack-5bcacf28.pth" to /home/bowbao/.cache/torch/hub/checkpoints/mobilenet_v3_large_qnnpack-5bcacf28.pth
100.0%
/home/bowbao/anaconda3/envs/torch39/lib/python3.9/site-packages/torch/ao/quantization/utils.py:210: UserWarning: must run observer before calling calculate_qparams. Returning default values.
warnings.warn(
start tracing...
end tracing...
Segmentation fault (core dumped)
Env:
Package Version
------------------ ------------------------
certifi 2021.10.8
charset-normalizer 2.0.10
idna 3.3
numpy 1.22.1
Pillow 9.0.0
pip 21.2.4
requests 2.27.1
setuptools 58.0.4
torch 1.11.0.dev20220203+cu111
torchvision 0.12.0.dev20220203+cu111
typing_extensions 4.0.1
urllib3 1.26.8
wheel 0.37.1
the api change won't cause segfault I think, could you wait one day to make sure the change is in nightlies? or can we check now?
🐛 Describe the bug
Sample code to reproduce:
Output:
The segfault happens randomly. When I tried to debug using pdb, the segfault happens at different lines each time (original code tried to do other stuff after tracing, the above
print
code seems to be the simplest repro I could create), and they all happen aftertorch.jit.trace
is called.Versions
Collecting environment information... PyTorch version: 1.11.0.dev20220127+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.22.2 Libc version: glibc-2.9
Python version: 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.13.0-27-generic-x86_64-with-debian-bullseye-sid Is CUDA available: False CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.21.2 [pip3] torch==1.11.0.dev20220127+cpu [pip3] torchvision==0.12.0.dev20220127+cpu [conda] mkl 2022.0.1 h06a4308_117 [conda] mkl-include 2022.0.1 h06a4308_117 [conda] numpy 1.21.2 py37hd8d4704_0 [conda] numpy-base 1.21.2 py37h2b8c604_0 [conda] torch 1.11.0.dev20220127+cpu pypi_0 pypi [conda] torchvision 0.12.0.dev20220127+cpu pypi_0 pypi