Closed CaffreyR closed 1 year ago
z is achieved via teh hard-sigmoid func, via (1- Q(z| tehta)), i.e. 1- cdf_qz(0), which is a sampled score in forward.
Therefore, hat(s) is not achieved by inputs but only by the parameters such as loga, etc. (updated during training) .
Emm, does it relevent to the equation of the paper? I mean in the paper it did say sˆ
in these two equation
And what is the difference of cdf_qz
and quantile_concrete
function?
check the paper carefully, the L0 norm is proposed here: Learning Sparse Neural Networks through $L_0$ Regularization (arxiv.org) detailed description of q(z) and that Q(z) the CDF of q is derived in the paper, where z is generated exactly as the paper in CoFi and the difference is that CoFi applies structured grouped parameter-masking strategies.
Thanks @zhangzhenyu13, for the answer!
It's also worth checking out Structured Pruning of Large Language Models, which is the first work that proposes to adapt L0 regularization to control the sparsity of the models.
Hi! In your code you calculate the
Lc
https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L682
And you use
expected_size
to calculateexpected_sparsity
, but does it match the equation in your paper?https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L267
Actually you said that
sˆ is the expected model sparsity calculated from z
, but the lagrangian_regularization() do not haveinputs
orz
Many thanks!