Closed thijsgelton closed 5 months ago
Could it be that you cannot run on GPU when training using sparseml.ultralytics.train ? It seems that when I switch to a CPU kaggle environment that I am able to train the model.
I got it to work eventually. I seems that running it distributed wasn't favourable. Probably adding the required pre-statement in front of it could fix that, but for now running it on a single GPU works.
@thijsgelton Hi, Can you please tell me what exactly you did to load the pruned and quantized model successfully? Currently, I have yolov8n that I trained with the quantized recipe but I can not load it, unfortunately.
Eventually I was able to run it using the following command on kaggle with 2 T4's (so distributed):
subprocess.run([
"python", "-m", "torch.distributed.run", "--no_python", "--nproc_per_node", "2",
"sparseml.ultralytics.train",
"--model",
"/kaggle/working/runs/detect/train/weights/best.pt",
'--recipe',
"zoo:cv/detection/yolov8-m/pytorch/ultralytics/voc/pruned80-none",
'--data',
"/kaggle/working/spaceship.yaml", "--batch=32", '--lr0', '0.0015', '--lrf=0.1', '--momentum', '0.85', '--mosaic', '0.95', '--mixup', '0.15', '--scale', '0.9',
'--recipe_args', '{"num_epochs": 50}', "--resume"])
@thijsgelton, I have trained the model using a quantization recipe. I am getting RuntimeError: Error(s) in loading state_dict for DetectionModel: Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale"......
while converting the trained model to onnx. I could not figure out the problem yet.
Describe the bug When running the steps from the ultralytics yolov8 tutorial: https://github.com/neuralmagic/sparseml/blob/main/integrations/ultralytics-yolov8/tutorials/sparse-transfer-learning.md#sparse-transfer-learning-with-a-custom-dataset
I cannot get it to work with my own dataset. It should be really straightforward according to the description in the tutorial, but instead I am getting ""Missing key(s) in state_dict: "model.0.conv.quant.activation_post_process.scale...."
Expected behavior I expected the fine-tuning to run perfectly.
Environment Include all relevant environment information:
f7245c8
]:To Reproduce https://www.kaggle.com/code/thijsgelton/trying-sparseml
You can see the exact 3 steps I did here.
Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
Additional context Add any other context about the problem here. Also include any relevant files.