BERT actually_prune option not working

Hi,

thanks for your code! The pruning works great using masking. However, when I tried to actually prune the model to see, if there's a speedup, it fails.

bash experiments/BERT/heads_pruning.sh SST-2 --actually_prune

The new_layer in prune_linear_layer has to be moved to the correct device.

new_layer.to(layer.weight.device)

The forward function fails, because the input and output shape of the previous layer do not seem to match:

13:09:27-INFO: Evaluating following pruning strategy
13:09:27-INFO: 9:3 10:10 11:3,7,8,9,10
13:09:27-INFO: ***** Running evaluation *****
13:09:27-INFO:   Num examples = 872
13:09:27-INFO:   Batch size = 32
Evaluating:   0%|                                                                | 0/28 [00:00<?, ?it/s]Traceback (most recent call last):
  File "pytorch-pretrained-BERT/examples/run_classifier.py", line 578, in <module>
    main()
  File "pytorch-pretrained-BERT/examples/run_classifier.py", line 514, in main
    scorer=processor.scorer,
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/examples/classifier_eval.py", line 78, in evaluate
    input_ids, segment_ids, input_mask, label_ids)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 1072, in forward
    output_all_encoded_layers=False, return_att=return_att)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 769, in forward
    output_all_encoded_layers=output_all_encoded_layers)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 458, in forward
    hidden_states, attn = layer_module(hidden_states, attention_mask)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 441, in forward
    attention_output, attn = self.attention(hidden_states, attention_mask)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 335, in forward
    self_output, attn = self.self(input_tensor, attention_mask)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/projects/are-16-heads-really-better-than-1/pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py", line 274, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 92, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/glock/.pyenv/versions/pruning/lib/python3.7/site-packages/torch/nn/functional.py", line 1408, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [4096 x 768], m2: [576 x 768] at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/generic/THCTensorMathBlas.cu:268

pmichel31415 / are-16-heads-really-better-than-1

BERT actually_prune option not working #3