google-research / google-research

Google Research
https://research.google
Apache License 2.0
33.8k stars 7.83k forks source link

[state_of_sparsity] - Knowledge transfer and reconstitution #15

Closed iandanforth closed 5 years ago

iandanforth commented 5 years ago

@sarahooker I really enjoyed the paper. I'd like to engage in a little speculation and I hope you'll indulge me.

Knowledge transfer during iterative sparsification

The lottery ticket result surprises me. I think you should be able to retrain to much closer to the same accuracy given the same initialization and a sparse mask. However I speculate that the magnitude pruning method induces knowledge transfer which prevents this.

Because the sparsity inducing mask changes during the iterative process you're dealing with some number of subnets. If they were fully disjoint you would transfer knowledge using one as a teacher and the other as the student. However in the iterative process you have a gradual knowledge transfer. Which means that the representations (and ultimate accuracy of the sparsified network) are no longer a function of sparse initial weights + training, but the full initial weights and the sparsification procedure.

If this is the case I suspect that if you do a single-step sparsification at the end of training and use that sparse mask along with the same initial weights (lottery ticket) you should see much closer accuracies.

(Iterative pruning is still a better way to do pruning of course.)

Knowledge reconstitution

I'm curious how much work has been done in the area of densifying sparse nets. For example, can you perfectly reverse the accurracy loss curves by increasing sparsity and retraining? Does it work better if you do this in one step (go from 90% sparsity to 70% sparsity by initializing a lot of random weights) or iteratively (90->85->80->75->70)

Ultimately the question is do you think a sparse bottleneck + densification + retraining procedure can produce a highly efficient and compressed version of finetuning?

sarahooker commented 5 years ago

Hi Ian,

Wonderful to hear you enjoyed our work! Thanks for these comments, I've put together some thoughts below. I'll tag an owner of this shared research repo to close this issue, but feel free to move this to email if you have additional questions (author email address for correspondence is listed in our paper.

1) Lottery ticket experiments using one-shot sparsification instead of iterative pruning

I agree, it would be fun to evaluate whether the lottery ticket results hold on these large scale tasks with “one-shot” sparsification. In fact, one of the variants in The lottery ticket hypothesis is whether lottery tickets occur in both one shot pruning and iteratively pruned networks.

However, for both one shot and iteratively pruned networks the authors compare the 1) performance of the sparse substructure trained from scratch (with same weights as initial random initialization) to 2) the performance of the original network.

However, the variant you propose appears to quite different, because you are comparing the performance of the sparse substructure trained from scratch (with same weights as initial random initialization) to the one shot pruned structure at the end of training.

Since both variants would likely perform substantially worse than the original model, it is unclear what information we gain here. I.e., you won't be able to tell whether one’s ability to match accuracy when re-training is a product of your hypothesis or just because the accuracy to match is worse (We suspect it is the latter). It's an interesting question, but I don't see a clear way to clearly disentangle the answer. Still easy to run this variant, and perhaps the results will surprise. :) You can simply run the magnitude pruning for a desired fraction of sparsity once at the end of pruning (I believe by setting the begin_pruning_step and end_pruning_step both to equal one before the last step of training).

2) Knowledge reconstitution

Hmmm, this I know less about. I believe Erich Elsen, one of my co-authors worked on a project related to this idea called dense-sparse-dense.

Hope these answers are somewhat helpful. Thanks again Ian for taking the time to put together these thoughts.

ekelsen commented 5 years ago

I think Sara meant to say that you should set the begin_pruning_step = final_step - 1 and end_pruning_step = final_step to mimic zero-shot pruning. You'll also need to set the threshold_decay parameter to 0, otherwise the threshold won't immediately jump to the necessary value to get the sparsity level you want.

Based on previous experience I've had with zero-shot pruning (see for example the last line of Table 4 in https://arxiv.org/pdf/1704.05119.pdf where the error rate more than doubles at 90% pruning), I would guess that zero-shot pruning will actually lead to worse accuracies than the random fixed sparsity patterns trained from scratch. If you try this, would love to know the results.