Fails in finding tickets if a CNN is used

rahulvigneswaran / Lottery-Ticket-Hypothesis-in-Pytorch

This repository contains a Pytorch implementation of the paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" by Jonathan Frankle and Michael Carbin that can be easily adapted to any model/dataset.

325 stars 91 forks source link

Fails in finding tickets if a CNN is used #3

Open paintception opened 4 years ago

paintception commented 4 years ago

Hi, thanks for your nice repo!

Have you tested whether the CNNs are able to find winning tickets on Cifar10 and Cifar100?

I ran multiple experiments with most of the convolutional architectures you provide but I'm only able to find "winning tickets" when using an MLP on the Mnist dataset. When a CNN is used (no matter which one) the experiments of the original paper, e.g. on Cifar10, cannot be reproduced.

Any idea why this is happening?

ffeng1996 commented 4 years ago

Same question:)

rahulvigneswaran commented 4 years ago

@paintception @ffeng1996 Sorry for the delayed response. The following are the winning tickets of Lenet5 over mnist, fashionmnist, cifar10 and cifar100. combined_lenet5_mnist combined_lenet5_fashionmnist combined_lenet5_cifar10 combined_lenet5_cifar100

Thanks for pointing it out! For some reason, at specific weight percentages, the winning tickets are not being generated. Let me take a look and get back to you soon.

ffeng1996 commented 4 years ago

Thanks for replying. However, when I run the code for AlexNet, DenseNet-121, ResNet-18 and VGG-16. The pruning methods cannot find the winner tickets.

Thanks.

ZhangXiao96 commented 4 years ago

Actually for large models/datasets, you may need some tricks, such as learnig rate warmup or "late resetting". You can find the details in some papers [1, 2]. Hope this helps!

[1] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL http://arxiv.org/abs/1803.03635.

[2] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. The lottery ticket hypothesis at scale. March 2019. URL http://arxiv.org/abs/1903.01611.

By the way, nice repo~

ffeng1996 commented 4 years ago

Thanks！

rahulvigneswaran commented 4 years ago

@ZhangXiao96 Thanks for the direction. Sorry that I am not able to reply promptly. I am busy with a few submissions. I will get back within a few days with a solution.

JKDomoguen commented 4 years ago

Hi, very much appreciate your work.

Few clarrifications, I've noticed in the your code particuarly in the "prune_by_percentile" func (line 269-292) in main.py that you don't seem to distinguish global pruning* for deeper networks e.g. VGG 19,Resnet 18 but instead prune each layer at the same rate (i.e. 10%) which are designed for fully connected layers applied in MNIST. Thank you for your time

*global pruning is specifically discussed in Chapter 4 of the original paper.

ZhangXiao96 commented 4 years ago

Hi, very much appreciate your work.

Few clarrifications, I've noticed in the your code particuarly in the "prune_by_percentile" func (line 269-292) in main.py that you don't seem to distinguish global pruning* for deeper networks e.g. VGG 19,Resnet 18 but instead prune each layer at the same rate (i.e. 10%) which are designed for fully connected layers applied in MNIST. Thank you for your time

*global pruning is specifically discussed in Chapter 4 of the original paper.

Hello, I think layerwised pruning is not especially designed for fully connected layers (see [1]). As you can see, we can keep the functional equivalence of a DNN by multiplying the weights of one layer by a number x and another layer by 1/x. However, this operation will change the results of global pruning, hence I believe layerwised pruning may be more reasonable. Just my opinions~

[1] Frankle et al., Stabilizing the Lottery Ticket Hypothesis

jfrankle commented 4 years ago

In my experience, global pruning works best on deeper convnets. Despite the fact that layers could theoretically rescale in inconvenient ways, that doesn't seem to happen in practice. Meanwhile, the layer sizes are so different in deep networks that layerwise pruning will delete small layers while leaving many extra parameters in big layers.

ZhangXiao96 commented 4 years ago

In my experience, global pruning works best on deeper convnets. Despite the fact that layers could theoretically rescale in inconvenient ways, that doesn't seem to happen in practice. Meanwhile, the layer sizes are so different in deep networks that layerwise pruning will delete small layers while leaving many extra parameters in big layers.

Thanks for your suggestions!

jamesoneill12 commented 4 years ago

combined_resnet18_cifar10

So this is what I got for CIFAR-10 using ResNet-18 with 256 batch size, 20 percent pruning for 20 pruning iterations over 60 epochs. My conclusion based on other work is that in order to stabilize the model compression you have to factor in the optimizer and architecture you are using, how to distribute the pruning (or more generally whatever compression method you are using) percentage per retraining step, how many retraining step you use, the capacity of the base model, the number of samples given for the dataset and the input-output dimensionality ratio.....basically everything :)