ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.18k stars 3.44k forks source link

KeyError: 'module_list.85.Conv2d.weight' #657

Closed Samjith888 closed 4 years ago

Samjith888 commented 4 years ago

Got the following error:

$ python train.py --data data/coco.data --cfg cfg/yolov3.cfg
Namespace(accumulate=2, adam=False, arc='default', batch_size=32, bucket='', cache_images=False, cfg='cfg/yolov3.cfg', data='data/coco.data', device='', epochs=273, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=False, var=None, weights='weights/ultralytics49.pt')
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1070', total_memory=8116MB)

Traceback (most recent call last):
  File "train.py", line 444, in <module>
    train()  # train normally
  File "train.py", line 111, in train
    chkpt['model'] = {k: v for k, v in chkpt['model'].items() if model.state_dict()[k].numel() == v.numel()}
  File "train.py", line 111, in <dictcomp>
    chkpt['model'] = {k: v for k, v in chkpt['model'].items() if model.state_dict()[k].numel() == v.numel()}
KeyError: 'module_list.85.Conv2d.weight'
(base) 
glenn-jocher commented 4 years ago

@Samjith888 your command automatically loads the ultralytics49.pt backbone, which requires yolov3-spp.cfg. You must remove the backbone by using --weights '', or specify a weights-cfg combination that is compatible.

This error is caused by a user supplying incompatible --weights and --cfg arguments. To solve this you must specify no weights (i.e. random initialization of the model) using --weights '' and any --cfg, or use a --cfg that is compatible with your --weights. If none are specified, the defaults are --weights ultralytics49.pt and --cfg cfg/yolov3-spp.cfg.

Compatible --weights --cfg combinations:

python3 train.py --weights yolov3.pt --cfg cfg/yolov3.cfg
python3 train.py --weights yolov3.weights --cfg cfg/yolov3.cfg
python3 train.py --weights yolov3-spp.pt --cfg cfg/yolov3-spp.cfg
python3 train.py --weights ultralytics49.pt --cfg cfg/yolov3-spp.cfg
python3 train.py --weights ultralytics68.pt --cfg cfg/yolov3-spp.cfg

To train from scratch (randomly initialized weights), use:

python3 train.py --weights '' --cfg cfg/*.cfg  # any cfg will work here

ultralytics49.pt is currently the highest performing YOLOv3 model (trained from scratch using this repo) available at the default img-size of 416 (see https://github.com/ultralytics/yolov3/issues/310), which is the reason it is used as the default backbone.

hanrui15765510320 commented 4 years ago

if i don't want pre_weights,how should i do?

okanlv commented 4 years ago

As @glenn-jocher said,

You must remove the backbone by using --weights ''

hanrui15765510320 commented 4 years ago

thanks,bro

daddydrac commented 4 years ago

I ran this: python3 train.py --data data/custom.data --cfg cfg/yolov3-spp-r.cfg

And got:

AssertionError: Target classes exceed model classes

What am I mising?

glenn-jocher commented 4 years ago

I'll close this issue for now as the original issue appears to have been resolved, and/or no activity has been seen for some time. Feel free to comment if this is not the case.

rohan-pradhan commented 4 years ago

Hi guys,

I'm trying to train on a custom CFG (therefore should be using a random initialization of weights). I understand that to do this we should set --weights ''

Unfortunately, even when I do that, it keeps trying to download the weights and I get this error: Exception: '' missing, try downloading from https://drive.google.com/open?id=1LezFG5g3BCW6iYaV89B2i64cqEUZD7e0

This is the full command I am using to train: python train.py --weights '' --cfg cfg/yolov3-custom.cfg --data data/coco1.data

Any help would be great - thanks!

glenn-jocher commented 4 years ago

@rohan-pradhan no space: --weights ''

$ python3 train.py --weights '' --data coco16.data

Namespace(accumulate=4, adam=False, arc='default', batch_size=16, bucket='', cache_images=False, cfg='cfg/yolov3-spp.cfg', data='coco16.data', device='', epochs=273, evolve=False, img_size=[416], multi_scale=False, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, var=None, weights='')
Using CPU

Caching labels (16 found, 0 missing, 0 empty, 0 duplicate, for 16 images): 100%|█████████████████████████████| 16/16 [00:00<00:00, 2515.70it/s]
Caching labels (16 found, 0 missing, 0 empty, 0 duplicate, for 16 images): 100%|█████████████████████████████| 16/16 [00:00<00:00, 5567.35it/s]
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Using 8 dataloader workers
Starting training for 273 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/272        0G       7.7      13.3      7.87      28.9       211       416: 100%|██████████████████████████| 1/1 [01:05<00:00, 65.12s/it]
               Class    Images   Targets         P         R   mAP@0.5        F1:   0%|                                  | 0/1 [00:00<?, ?it/s]
rohan-pradhan commented 4 years ago

Thanks for the quick response, Glenn. Unfortunately, even when I copy and paste your command it still gives the same error.

`>python train.py --weights '' --data coco1.data
Namespace(accumulate=4, adam=False, arc='default', batch_size=16, bucket='', cache_images=False, cfg='cfg/yolov3-spp.cfg', data='coco1.data', device='', epochs=273, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=False, var=None, weights="''")
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11264MB)

2020-01-23 11:02:59.119516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Downloading https://pjreddie.com/media/files/''
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
'rm' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "train.py", line 463, in <module>
    train()  # train normally
  File "train.py", line 108, in train
    attempt_download(weights)
  File "C:\Users\Rohan\Documents\Development\Thesis\yolov3\models.py", line 454, in attempt_download
    raise Exception(msg)
Exception: '' missing, try downloading from https://drive.google.com/open?id=1LezFG5g3BCW6iYaV89B2i64cqEUZD7e0`

Not sure why it is treating '' as a string.

rohan-pradhan commented 4 years ago

Figured it out! Changed it to --weights "" and it seemed to work.

Thanks again!

glenn-jocher commented 4 years ago

@rohan-pradhan ah interesting. What's your OS?

rohan-pradhan commented 4 years ago

@glenn-jocher I'm running Windows 10 in a Conda environment (Anaconda Prompt).

glenn-jocher commented 4 years ago

@rohan-pradhan hmm ok. Perhaps it's windows.

sunset326 commented 4 years ago

hi,guys when i run python train.py --data data/rbc.data --cfg cfg/yolov3.cfg --weights ""
python train.py --data data/rbc.data --cfg cfg/yolov3.cfg --weights ''
python train.py --data data/rbc.data --cfg cfg/yolov3.cfg --weights weights/yolov3.pt
python train.py --data data/rbc.data --cfg cfg/yolov3.cfg --weights weights/yolov3.weights

the same error occured,as follows. my pytorch is 1.5.1 + torchvision 0.6.0

Traceback (most recent call last): File "train.py", line 431, in train(hyp) # train normally File "train.py", line 164, in train model, optimizer = amp.initialize(model, optimizer, opt_level='O1', verbosity=0) File "/home/anaconda2/envs/Maskrcnn_Benchmark/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/frontend.py", line 339, in initialize return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs) File "/home/anaconda2/envs/Maskrcnn_Benchmark/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 228, in _initialize handle = amp_init(loss_scale=properties.loss_scale, verbose=(_amp_state.verbosity == 2)) File "/home/anaconda2/envs/Maskrcnn_Benchmark/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/amp.py", line 101, in init try_caching, verbose) File "/home/anaconda2/envs/Maskrcnn_Benchmark/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 33, in cached_cast if not utils.has_func(mod, fn): File "/home/anaconda2/envs/Maskrcnn_Benchmark/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/utils.py", line 132, in has_func if isinstance(mod, torch.nn.backends.backend.FunctionBackend): AttributeError: module 'torch.nn' has no attribute 'backends `

glenn-jocher commented 4 years ago

@sunset326 update torch to latest version.

sunset326 commented 4 years ago

@sunset326 update torch to latest version.

thx,brother i have solved the problem,the requirement.txt says python > = 3.7, i update my python,and the problem doesn't occures.

glenn-jocher commented 8 months ago

@sunset326 Great to hear that updating Python resolved the issue! If you have any more questions or run into further issues, feel free to ask. Happy training! 🚀