neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.05k stars 144 forks source link

Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale" #2258

Closed thijsgelton closed 5 months ago

thijsgelton commented 5 months ago

Describe the bug When running the steps from the ultralytics yolov8 tutorial: https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md#sparse-transfer-learning-with-a-custom-dataset

I cannot get it to work with my own dataset. It should be really straightforward according to the description in the tutorial, but instead I am getting ""Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale...."

Expected behavior I expected the fine-tuning to run perfectly.

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 18.04]:
  2. Python version [e.g. 3.7]:
  3. SparseML version or commit hash [e.g. 0.1.0, f7245c8]:
  4. ML framework version(s) [e.g. torch 1.7.1]:
  5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce https://www.kaggle.com/code/thijsgelton/trying-sparseml

You can see the exact 3 steps I did here.

Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

  File "/root/.config/Ultralytics/DDP/_temp_8105lcbi134781398669712.py", line 4, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/sparseml/yolov8/trainers.py", line 179, in train
    self._do_train(world_size)
  File "/opt/conda/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 276, in _do_train
    self._setup_train(world_size)
  File "/opt/conda/lib/python3.10/site-packages/sparseml/yolov8/trainers.py", line 295, in _setup_train
    super()._setup_train(world_size)
  File "/opt/conda/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 212, in _setup_train
    ckpt = self.setup_model()
  File "/opt/conda/lib/python3.10/site-packages/sparseml/yolov8/trainers.py", line 224, in setup_model
    self.model.load_state_dict(ckpt["model"])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DetectionModel:
    Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale", "model.0.conv.quant.activation_post_process.zero_point", "model.0.conv.quant.activation_post_process.fake_quant_enabled", "model.0.conv.quant.activation_post_process.observer_enabled", ....

Additional context Add any other context about the problem here. Also include any relevant files.

thijsgelton commented 5 months ago

Could it be that you cannot run on GPU when training using sparseml.ultralytics.train ? It seems that when I switch to a CPU kaggle environment that I am able to train the model.

thijsgelton commented 5 months ago

I got it to work eventually. I seems that running it distributed wasn't favourable. Probably adding the required pre-statement in front of it could fix that, but for now running it on a single GPU works.

imAhmadAsghar commented 4 months ago

@thijsgelton Hi, Can you please tell me what exactly you did to load the pruned and quantized model successfully? Currently, I have yolov8n that I trained with the quantized recipe but I can not load it, unfortunately.

thijsgelton commented 4 months ago

Eventually I was able to run it using the following command on kaggle with 2 T4's (so distributed):

subprocess.run([
    "python", "-m", "torch.distributed.run", "--no_python", "--nproc_per_node", "2",
    "sparseml.ultralytics.train",
    "--model",
    "/kaggle/working/runs/detect/train/weights/best.pt",
  '--recipe',
    "zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80-none",
  '--data',
    "/kaggle/working/spaceship.yaml", "--batch=32", '--lr0', '0.0015', '--lrf=0.1', '--momentum', '0.85', '--mosaic', '0.95', '--mixup', '0.15', '--scale', '0.9',
'--recipe_args', '{"num_epochs": 50}', "--resume"])
imAhmadAsghar commented 4 months ago

@thijsgelton, I have trained the model using a quantization recipe. I am getting RuntimeError: Error(s) in loading state_dict for DetectionModel: Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale"...... while converting the trained model to onnx. I could not figure out the problem yet.