About the diag() and distillation in your paper

princeton-nlp / CoFiPruning

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

MIT License

188 stars 32 forks source link

About the diag() and distillation in your paper #25

Closed CaffreyR closed 1 year ago

CaffreyR commented 1 year ago

Hi @xiamengzhou , many thanks to your contribution. I have small questions in your paper, in your paper you said that

FNN pruning introduce a Zint

And in your paper there is a Eq, but what is diag, why do we have to put Zint into a diagonal matrix? Do diag(Zint) is df*df size?

And you also says that

Coarse-grained and Fine- grained units (§3.1) with a layerwise distillation objective transferring knowledge from unpruned to pruned models (§3.2)

However, distilling intermediate layers during the pruning process is challenging as the model struc- ture changes throughout training. （previous method）

So are we pruning a student model during distillation？

Many thanks!!

xiamengzhou commented 1 year ago

Hi,

Thanks for reaching out!

For your first question, multiplying the representations with diag(z_int) essentially multiplies the output dimension of the representations with the corresponding mask. We use diag(z_int) as a matrix notation.

For your second question, yes! CoFi pruning prunes a student model with a distillation objective.

Feel free to reach out again if you have more questions :)

CaffreyR commented 1 year ago

Hi, so diag is a diagonal matrix with the zint on its diagonal line？

xiamengzhou commented 1 year ago

Yes!

CaffreyR commented 1 year ago

Thanks! So why it is have to be a diagonal matrix? Can a non-diagonal matrix replace it as long as the non-diagonal matrix represents the corresponding mask?

xiamengzhou commented 1 year ago

Yes, it can! We use diag in our paper for mathematical correctness.

xiamengzhou commented 1 year ago

Hi, I am closing this issue now! Feel free to reopen it if you have more questions :)