Unclear usage of training look and training_steps in ActivationAPoZRankPruner (model compression, pruning)

Michelvl92 commented 1 year ago

Describe the issue: The workings of the ActivationAPoZRankPruner and its example examples/model_compress/pruning/activation_pruning_torch.py](https://github.com/microsoft/nni/blob/bfacbc28de70fa30d24676e6dc13dae587ab181e/examples/model_compress/pruning/activation_pruning_torch.py) are unclear.

I understand from the reference paper a pre-train step is required (if needed), and after pruning (with APoZ) fine-tuning training is required to "repair" the introduced error, This is fully clear. But what I do not understand is why are training steps required in the APoZ pruner itself (since pre-training and fine-tuning are done afterwards?). From activation-apoz-rank-pruner I understand that these training steps are required to collect the activations (which makes sense), but why are training steps required, and not regular inference steps used? Since training is required, it looks as if the model is being trained, and my model's loss and precision are changing in these steps (it is printed by a callback in my training loop).

Furthermore, in the training loop, the loss and precision of my model (not the masked model) look as expected (similar to before pruning). But when I evaluate my model (before fine-tuning) the precision is 0, which is strange. So why Is there a difference between the training precision during pruning, and afterwards evaluation with the same set as used during training on the model before calling pruner._unwrap_model()?

Environment:

NNI version: 2.10
Client OS: Ubuntu 20.04.5 LTS
Python version: 3.8.13
PyTorch/TensorFlow version: 1.13.0a0+d321be6
Is conda/virtualenv/venv used?: conda
Is running in Docker?: Yes, based on: nvcr.io/nvidia/pytorch:22.08-py3

J-shang commented 1 year ago

Thanks for your issue, it is really meaningful. You are right, APoZ pruner should use inference step, not training step. We are working on the next compression version and will fix this problem.

The difference between the training precision during pruning and afterwards is because the masks are generated and apply to the model after the last optimizer.step() is done (during pruning training), so fine-tuning is needed to recovery the precision after pruning.

Michelvl92 commented 1 year ago

Thanks for your reply and explanation.

Could you provide any timeline on the next compression version where this will be fixed? What would be a good workaround in the meantime? e.g. will be changing the training function to perform inference only enough?

J-shang commented 1 year ago

I think a workaround is set the lr of the optimizer to zero, or remove loss.backward(). optimizer.step() is needed to call at the end of each step, because nni count the step number by counting how many times called the optimizer.step().

The next APoZ version is planing in nni 3.1, if it can be released earlier, I will let you know.

aidevmin commented 1 year ago

@Michelvl92 Can you run pruning yolov7? Please share information with me. I had error Comparison exception: The values for attribute 'shape' do not match: torch.Size([]) != torch.Size([1, 1, 40, 40, 2]).

microsoft / nni

Unclear usage of training look and training_steps in ActivationAPoZRankPruner (model compression, pruning) #5405