pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.99k stars 6.92k forks source link

Segfault after tracing quantized mobilenet v3 using nightly version #5303

Open BowenBao opened 2 years ago

BowenBao commented 2 years ago

🐛 Describe the bug

Sample code to reproduce:

import torch
import torchvision
from torchvision import transforms
from PIL import Image

def download_file(url, filename):
    import urllib
    try:
        urllib.URLopener().retrieve(url, filename)
    except:
        urllib.request.urlretrieve(url, filename)

def download_data():
    download_file("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")

def trace_mobilenet():
    model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True, progress=True, quantize=True)
    model.eval()

    # validate that model runs
    input_image = Image.open("dog.jpg")
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    input_tensor = preprocess(input_image)
    input_batch = input_tensor.unsqueeze(0)  # create a mini-batch as expected by the model

    print('start tracing...')
    export_model = torch.jit.trace(model, input_batch)
    print('end tracing...')
    print(model)
    print(export_model)

download_data()
trace_mobilenet()

Output:

/anaconda3/envs/torch37/lib/python3.7/site-packages/torch/ao/quantization/utils.py:175: UserWarning: must run observer before calling calculate_qparams. Returning default values.
  "must run observer before calling calculate_qparams. " +
start tracing...
end tracing...
Segmentation fault (core dumped)

The segfault happens randomly. When I tried to debug using pdb, the segfault happens at different lines each time (original code tried to do other stuff after tracing, the above print code seems to be the simplest repro I could create), and they all happen after torch.jit.trace is called.

Versions

Collecting environment information... PyTorch version: 1.11.0.dev20220127+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.22.2 Libc version: glibc-2.9

Python version: 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.13.0-27-generic-x86_64-with-debian-bullseye-sid Is CUDA available: False CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.21.2 [pip3] torch==1.11.0.dev20220127+cpu [pip3] torchvision==0.12.0.dev20220127+cpu [conda] mkl 2022.0.1 h06a4308_117 [conda] mkl-include 2022.0.1 h06a4308_117 [conda] numpy 1.21.2 py37hd8d4704_0 [conda] numpy-base 1.21.2 py37h2b8c604_0 [conda] torch 1.11.0.dev20220127+cpu pypi_0 pypi [conda] torchvision 0.12.0.dev20220127+cpu pypi_0 pypi

datumbox commented 2 years ago

I am unable to reproduce on my side. How long have you been facing this? I wonder if this could be related to the rework at https://github.com/pytorch/pytorch/pull/70009. There is an outstanding PR https://github.com/pytorch/vision/pull/5299 to address the changes.

BowenBao commented 2 years ago

hi @datumbox, I'm seeing this most recently with both master/main branch on pytorch and vision. Just now I tried again with both latest branch, pytorch @ 31b348411afa608639a2f7353060974c849829dd and vision @ 435eddf7a8200cc26338036a0a5f7db067ac7b0c. The below short script results in segfault:

    import torch
    import torchvision
    model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True, quantize=True)
    print(model)
    print(model.state_dict())

Removing print(model), I would get arbitrary results of the following

vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
  "must run observer before calling calculate_qparams. " +
Traceback (most recent call last):
  File "export_pt_mobilenet_v3_quant.py", line 49, in convert_mobilenet
    print(model.state_dict())
  File "pytorch/torch/nn/parameter.py", line 37, in __repr__
    return 'Parameter containing:\n' + super(Parameter, self).__repr__()
  File "pytorch/torch/_tensor.py", line 294, in __repr__
    return torch._tensor_str._str(self)
  File "pytorch/torch/_tensor_str.py", line 434, in _str
    return _str_intern(self)
  File "pytorch/torch/_tensor_str.py", line 317, in _str_intern
    if self.device.type != torch._C._get_default_device()\
RuntimeError: tensor does not have a device

or

vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
  "must run observer before calling calculate_qparams. " +
Segmentation fault (core dumped)

or

vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
pytorch/torch/ao/quantization/utils.py:211: UserWarning: must run observer before calling calculate_qparams. Returning default values.
  "must run observer before calling calculate_qparams. " +
Traceback (most recent call last):
  File "export_pt_mobilenet_v3_quant.py", line 49, in convert_mobilenet
    print(model.state_dict())
KeyError: 'features.6.block.3.0.bias'

from running

    import torch
    import torchvision
    model = torchvision.models.quantization.mobilenet_v3_large(pretrained=True, quantize=True)
    print(model.state_dict())
datumbox commented 2 years ago

@BowenBao Thanks for keep looking at it.

@jerryzh168 this might be related to the changes done at https://github.com/pytorch/pytorch/pull/71956

Unfortunately I still can not reproduce the problem using the latest TorchVision main and pytorch 1.11.0.dev20220202 py3.9_cuda11.1_cudnn8.0.5_0 pytorch-nightly. Running the above script on a linux dev-server yields the results as expected.

The other crazy thing is that on the same nightly on macOSX pytorch 1.11.0.dev20220202 py3.8_0 pytorch-nightly, it seems we are missing the new Quantization API. Here is the error:

    method = torch.ao.quantization.fuse_modules_qat if is_qat else torch.ao.quantization.fuse_modules
AttributeError: module 'torch.ao.quantization' has no attribute 'fuse_modules_qat'

Jerry any thoughts?

jerryzh168 commented 2 years ago

looks like a problem with version control, fuse_modules_qat is introduced in the most recent master, maybe it's not picked up by the nightlies yet? It is in master right now: https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/__init__.py#L5

BowenBao commented 2 years ago

@datumbox created a fresh conda environment and tried similar nightly branch, I could still repro ...

Could it be due to different pretrained model checkpoint?

Edit: nvm different checkpoint, could repro with pretrained=False too .. But it is very random. If I comment out the image loading part, and use a random tensor as input, pretrained=False would succeed in the initial repro code. pretrained=True still segfaults.

Logs

Downloading: "https://download.pytorch.org/models/quantized/mobilenet_v3_large_qnnpack-5bcacf28.pth" to /home/bowbao/.cache/torch/hub/checkpoints/mobilenet_v3_large_qnnpack-5bcacf28.pth
100.0%
/home/bowbao/anaconda3/envs/torch39/lib/python3.9/site-packages/torch/ao/quantization/utils.py:210: UserWarning: must run observer before calling calculate_qparams. Returning default values.
  warnings.warn(
start tracing...
end tracing...
Segmentation fault (core dumped)

Env:

Package            Version
------------------ ------------------------
certifi            2021.10.8
charset-normalizer 2.0.10
idna               3.3
numpy              1.22.1
Pillow             9.0.0
pip                21.2.4
requests           2.27.1
setuptools         58.0.4
torch              1.11.0.dev20220203+cu111
torchvision        0.12.0.dev20220203+cu111
typing_extensions  4.0.1
urllib3            1.26.8
wheel              0.37.1
jerryzh168 commented 2 years ago

the api change won't cause segfault I think, could you wait one day to make sure the change is in nightlies? or can we check now?