Not seeing the inference speed up on cuda using the sparse trainer notebook

huggingface / nn_pruning

Prune a model while finetuning or training.

Apache License 2.0

393 stars 57 forks source link

Hi @madlag , I have tried the notebook which is very similar to the notebook you shared in the issue #5 but I am not seeing any speed up at the end if we move the models to cuda, although I can see about 1.3X speed up on cpu. I am running this on EC2 g4dn.2xlarge instance which has T4 card.

This is my training code and this is the inference code. I wonder if I am missing something here.

The parameter counts shows the reduction but the inference speed is both pruned and non-pruned ~9 ms.

prunebert_model.num_parameters() / bert_model_original.num_parameters() = 0.6118184376136527

Thanks for you help and the great work.

huggingface / nn_pruning

Not seeing the inference speed up on cuda using the sparse trainer notebook #27