princeton-nlp / CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
MIT License
192 stars 31 forks source link

Questions about some code #8

Closed pyf98 closed 2 years ago

pyf98 commented 2 years ago

Hi, thanks for the great work! I have some questions about the current code.


First, is this following line expected? https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L667

zs = {key: inputs[key] for key in inputs if "_z" in inputs}

Should it be zs = {key: inputs[key] for key in inputs if "_z" in key} in order to extract zs from inputs?


Second, what is the last term self.hidden_size * 4 in the following line when calculating the params of an FFN layer? https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L44

self.params_per_mlp_layer = self.hidden_size * self.intermediate_size * 2 + self.hidden_size + self.hidden_size * 4

I guess it means the bias parameter of the intermediate dense layer, so it is equivalent to self.intermediate_size?


Third, when initializing the loga params in l0_module, the structured_mlp uses a different mean compared with other components, as shown in the following line: https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L147

It seems the intermediate dimension has an initial sparsity of 0.5, even before any pruning. What is the intuition of setting it this way?

Thank you very much for your time!

xiamengzhou commented 2 years ago

Hi,

Thanks for following our work! To answer your questions:

1) Yes, you are right! It should be if "_z" in key. The current logic creates an empty zs. But this issue only slightly affects layer distillation version 4 where we control the order of the layer. I just fixed it :)

2) Yes, your understanding is correct!

3) We largely follow FLOP for the initialization. As you said, if the mean is 0, it is likely that the number of samples from the distribution will be 0. In practice, it does not affect optimization much because the l0 loss is able train loga to meet the sparsity requirement very quickly. We did observe that it setting the mean to 0 for larger units (head, mlp) would cause the start of pruning to be unstable, thus we set it to be a large number to smooth the optimization.

Hope this helps and feel free to ask more questions!

xiamengzhou commented 2 years ago

Hi,

I am closing this issue and feel free to open it again if you have more questions :)