RuntimeError: Error(s) in loading state_dict for DetectionModel

JueXiuHuang commented 1 year ago

Describe the bug Error occur when training process finish. I ran training with script below:

sparseml.yolov5.train \
  --cfg yolov5s.yaml \
  --data ./datasets/RDRR/RDRR.yaml \
  --batch-size 64 --patience 50 --imgsz 256 \
  --recipe ./v5sparse/recipe/v5s/pruned85_quant.md \
  --hyp ./yolov5-master/data/hyps/hyp.scratch-low.yaml \
  --weights zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none \
  --gradient-accum-steps 4 --optimizer SGD \
  --project yolov5-deepsparse --name yolov5s-sgd-pruned85-quantized-original

recipe is downloaded from sparsezoo, hyp file is from neuralmagic's yolov5 fork. Training process is fine until ....

Model summary: 214 layers, 7111327 parameters, 7111327 gradients, 16.2 GFLOPs

Traceback (most recent call last):
  File "/home/joshua/v5sparse/bin/sparseml.yolov5.train", line 8, in <module>
    sys.exit(train())
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/sparseml/yolov5/scripts.py", line 41, in train
    train_run(**vars(opt))
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/yolov5/train.py", line 731, in run
    main(opt)
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/yolov5/train.py", line 631, in main
    train(opt.hyp, opt, device, callbacks)
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/yolov5/train.py", line 496, in train
    model = attempt_load(f, device)
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/yolov5/models/experimental.py", line 87, in attempt_load
    else load_sparsified_model(ckpt, device=device or "cpu")
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/yolov5/utils/neuralmagic/utils.py", line 116, in load_sparsified_model
    model.load_state_dict(ckpt["ema"] or ckpt["model"], strict=True)
  File "/home/joshua/v5sparse/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DetectionModel:
        Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale", "model.0.conv.quant.activation_post_process.zero_point", "model.0.conv.quant.activation_post_process.fake_quant_enabled", "model.0.conv.quant.activation_post_process.observer_enabled", "model.0.conv.quant.activation_post_process.scale",

I made some omissions because the key is too long.

Expected behavior The script should train and save the model perfectly.

Environment Include all relevant environment information:

OS : Ubuntu 20.04
Python version: 3.8
SparseML version or commit hash: sparseml==1.4.2
ML framework version(s): torch==1.12.0+cu116

Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:

yolov5==6.2.0
torchvision==0.13.0+cu116
onnx==1.12.0
onnxruntime==1.14.1

Errors I put the full error message into error.txt. error.txt

How can I fix this?

KSGulin commented 1 year ago

Hi @JueXiuHuang, I tried your command locally (with the coco128 dataset) and it ran to completion. Can you confirm that the recipe you're pointing to is identical to the "originla.md" recipe found at this model card . If it's not, could you paste the full recipe here?

jeanniefinks commented 1 year ago

Hi @JueXiuHuang We wanted to follow up with you and see if you're able to provide confirmation on the recipe used? Thank you so much! Jeannie / Neural Magic

jeanniefinks commented 1 year ago

As it's been some time with no further comments @JueXiuHuang , we are going to go ahead and close this issue. If you would like to continue the conversation, please re-open the issue. Thank you! Jeannie / Neural Magic

neuralmagic / sparseml

RuntimeError: Error(s) in loading state_dict for DetectionModel #1444