关于trainer代码中的一些问题

wcyjerry commented 8 months ago

这里为什么教师模型不冻结参数呢？

huangzongmou commented 8 months ago

您好！您的邮件已收到，尽快给你回复。谢谢黄总谋！

wcyjerry commented 8 months ago

另外在Feature_loss里面用于对齐的1x1卷积，我觉得应该是要加入到反向传播的过程中吧，但是这里好像只是用了一个随机初始化的卷积，并没有在推理过程中更新

wcyjerry commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确
教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

wujianfei5201314 commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确

教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

你好，有没有试过将剪枝后的模型蒸馏，我发现蒸馏过程，参数量与计算量还是按照yolov8 尺寸训练的呀

huangzongmou commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确

教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

请看trainer.py：566行 with torch.no_grad(): pred = self.Distillation(batch['img'])

huangzongmou commented 7 months ago

另外在Feature_loss里面用于对齐的1x1卷积，我觉得应该是要加入到反向传播的过程中吧，但是这里好像只是用了一个随机初始化的卷积，并没有在推理过程中更新

这个我先看看，太久不写太久了不已经不记得了

wcyjerry commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确

教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

请看trainer.py：566行 with torch.no_grad(): pred = self.Distillation(batch['img'])

嗯，第二个确实是用了no_grad的，但是我还是觉得没必要在前面把他的parameters的require grad设为True， align模块没有加入到梯度传递应该是存在的，可以fix下感觉，可能是效果不好的一个原因。

wcyjerry commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确

教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

你好，有没有试过将剪枝后的模型蒸馏，我发现蒸馏过程，参数量与计算量还是按照yolov8 尺寸训练的呀

你好，没有尝试过，剪枝这块不是很了解，所以不清楚你遇到的问题是什么样的

wujianfei5201314 commented 7 months ago

问题是什么样的 model_t = YOLO('D:/yolov8/runs/train/yolov8-test6/weights/best.pt') data = "D:/yolov8717/yb3/data.yaml" model_t.train(data=data, epochs=1, imgsz=640, device='0', Distillation=None) model_t.model.model[-1].set_Distillation = True model_s = YOLO('D:/yolov8/Torch-Pruning/weights/best.pt') model_s.train(data=data, epochs=1, imgsz=640, Distillation=model_t.model) 代码的结果是学生模型将权重转移到了教师模型，最后学生模型会得到教师模型的大小、参数以及计算量。我是在新yolov8版本上使用的，不知道具体什么原因。

YOLOv8-test6 summary: 328 layers, 10408639 parameters, 10408623 gradients, 27.3 GFLOPs Transferred 243/471 items from pretrained weights AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n... AMP: checks passed Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to runs\detect\train42 Starting training for 1 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/590 [00:00<?, ?it/s]tensor(31.1501, device='cuda:0', grad_fn=) tensor(198.4706, device='cuda:0', grad_fn=) 1/1 5.79G 2.547 3.398 2.961 30 640: 100%|██████████| 590/590 [04:26<00:00, 2.21it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 29/29 [00:20<00:00, 1.40it/s] all 899 899 0.187 0.151 0.0818 0.0233

1 epochs completed in 0.082 hours. Optimizer stripped from runs\detect\train42\weights\last.pt, 21.1MB Optimizer stripped from runs\detect\train42\weights\best.pt, 21.1MB

Validating runs\detect\train42\weights\best.pt... Ultralytics YOLOv8.0.114 Python-3.9.17 torch-2.0.0+cu118 CUDA:0 (NVIDIA GeForce RTX 3070 Laptop GPU, 8192MiB) YOLOv8-test6 summary: 270 layers, 10399391 parameters, 0 gradients, 27.1 GFLOPs Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 29/29 [00:09<00:00, 2.90it/s] all 899 899 0.187 0.15 0.082 0.0234 Speed: 0.3ms preprocess, 4.0ms inference, 0.0ms loss, 1.8ms postprocess per image Results saved to runs\detect\train42

wujianfei5201314 commented 7 months ago

model_t = YOLO('D:/yolov8/runs/train/yolov8-test6/weights/best.pt') data = "D:/yolov8717/yb3/data.yaml" model_t.train(data=data, epochs=1, imgsz=640, device='0', Distillation=None) model_t.model.model[-1].set_Distillation = True model_s = YOLO('D:/yolov8/Torch-Pruning/weights/best.pt') model_s.train(data=data, epochs=1, imgsz=640, Distillation=model_t.model) 代码的结果是学生模型将权重转移到了教师模型，最后学生模型会得到教师模型的大小、参数以及计算量。我是在新yolov8版本上使用的，不知道具体什么原因。

YOLOv8-test6 summary: 328 layers, 10408639 parameters, 10408623 gradients, 27.3 GFLOPs Transferred 243/471 items from pretrained weights AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n... AMP: checks passed Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to runs\detect\train42 Starting training for 1 epochs...

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 0%| | 0/590 [00:00<?, ?it/s]tensor(31.1501, device='cuda:0', grad_fn=) tensor(198.4706, device='cuda:0', grad_fn=) 1/1 5.79G 2.547 3.398 2.961 30 640: 100%|██████████| 590/590 [04:26<00:00, 2.21it/s] Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 29/29 [00:20<00:00, 1.40it/s] all 899 899 0.187 0.151 0.0818 0.0233

1 epochs completed in 0.082 hours. Optimizer stripped from runs\detect\train42\weights\last.pt, 21.1MB Optimizer stripped from runs\detect\train42\weights\best.pt, 21.1MB

Validating runs\detect\train42\weights\best.pt... Ultralytics YOLOv8.0.114 Python-3.9.17 torch-2.0.0+cu118 CUDA:0 (NVIDIA GeForce RTX 3070 Laptop GPU, 8192MiB) YOLOv8-test6 summary: 270 layers, 10399391 parameters, 0 gradients, 27.1 GFLOPs Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 29/29 [00:09<00:00, 2.90it/s] all 899 899 0.187 0.15 0.082 0.0234 Speed: 0.3ms preprocess, 4.0ms inference, 0.0ms loss, 1.8ms postprocess per image Results saved to runs\detect\train42

huangzongmou commented 7 months ago

Hi, 参考了一下其他仓库的蒸馏代码，这里应该存在两个bug，

loss类里面的参数没有加入到优化器中进行反向传播和更新，导致结果不正确

教师网络没有将网络的梯度设置为false，引入额外的显存消耗。（或者考虑在对教师模型的时候使用with torch.no_grad()）

请看trainer.py：566行 with torch.no_grad(): pred = self.Distillation(batch['img'])

嗯，第二个确实是用了no_grad的，但是我还是觉得没必要在前面把他的parameters的require grad设为True， align模块没有加入到梯度传递应该是存在的，可以fix下感觉，可能是效果不好的一个原因。

但是我还是觉得没必要在前面把他的parameters的require grad设为True，你可以试试，我忘记是不是碰到什么报错加的。但是后面还是no_grad，这个无关紧要。

huangzongmou / yolov8_Distillation

关于trainer代码中的一些问题 #5