pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.89k stars 6.9k forks source link

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

Open giangdip2410 opened 1 year ago

giangdip2410 commented 1 year ago

🐛 Describe the bug

I faced the below issue after training vit_h_14 model with pretrained weights. If I do not load pretrained weights, everything is fine.

how to reproduce this bug

import torchvision model = torchvision.models.get_model('vit_h_14', weights='DEFAULT')

or

model = torchvision.models.get_model('vit_h_14', weights='IMAGENET1K_SWAG_E2E_V1')

Traceback (most recent call last):
  File "train.py", line 545, in <module>
    main(args)
  File "train.py", line 225, in main
    model = torchvision.models.get_model(args.model, weights=args.weights)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 225, in get_model
    return fn(**config)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 764, in vit_h_14
    **kwargs,
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 335, in _vision_transformer
    model.load_state_dict(weights.get_state_dict(progress=progress))
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 66, in get_state_dict
    return load_state_dict_from_url(self.url, progress=progress)
  File "/usr/local/lib/python3.7/dist-packages/torch/hub.py", line 731, in load_state_dict_from_url
    return torch.load(cached_file, map_location=map_location)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 726, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 262, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33723) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------

Versions

Collecting environment information... PyTorch version: 1.13.0.dev20220810+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.26

Python version: 3.7.5 (default, Dec 9 2021, 17:04:37) [GCC 8.4.0] (64-bit runtime) Python platform: Linux-5.4.0-122-generic-x86_64-with-Ubuntu-18.04-bionic Is CUDA available: True CUDA runtime version: 11.2.152 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.141.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.21.6 [pip3] pytorch-lightning==1.7.1 [pip3] pytorch-lightning-bolts==0.3.2.post1 [pip3] torch==1.13.0.dev20220810+cu113 [pip3] torchaudio==0.13.0.dev20220810+cu113 [pip3] torchmetrics==0.9.3 [pip3] torchvision==0.14.0.dev20220810+cu113 [conda] Could not collect

datumbox commented 1 year ago

@giangdip2410 I can't reproduce the problem. The following works fine with me:

import torchvision
torchvision.models.get_model('vit_h_14', weights='DEFAULT')

Judging from your error message:

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

It seems that the local weights were not downloaded properly or in the right location locally. They typically get stored at your ~/.cache/torch/hub/checkpoints. Try deleting the existing vit_h14* from there to force their redownloading and ensure that the path is accessible via your script when you run the analysis.

jxguo14 commented 1 year ago

I also had the same problem when using pre-trained weights to train the ssd model,the command I used is 'torchrun --nproc_per_node=8 train.py\ --dataset coco --model ssd300_vgg16 --epochs 120\ --lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4\ --weight-decay 0.0005 --data-augmentation ssd --weights-backbone VGG16_Weights.IMAGENET1K_FEATURES'. and i have already checked the weights in ~/.cache/torch/hub/checkpoints , can you give me some advice?