Low accuracy of merged model

lejelly commented 1 year ago

Hi, Thanks for your great work!

I attempted to reproduce the results in Table 1(b) of your paper.

From my understanding, this experiment involves the following steps:

Divide CIFAR100 into 50 classes each.
Train two Resnet20 (widthx8) models separately.
Validate the classification results after merging.

I believe Table 1(b) presents the average and standard deviation of these 1, 2, and 3 steps repeated four times.

In an effort to replicate this, I executed the code with minimal modifications. The modifications I made were:

In cifar_resnet_training.py, I set model_width = 8.
In cifar50_resnet20.py, I changed 'eval_type' to 'clip'.

As a result, the accuracy after merging decreased regardless of the merging method. Are there any hyperparameters that need to be taken into consideration?(params a and b are default setting)

I appreciate once again that you are sharing this wonderful research with me.

lejelly commented 1 year ago

https://github.com/gstoica27/ZipIt/issues/11#issuecomment-1601626339

This question appears to be similar to mine.

shoroi commented 1 year ago

Hey @gstoica27, congrats on the great work and thank you for sharing this code base with us! I'm looking forward to using this repo to advance my own research.

Unfortunately I'm running into the same problem as the one stated above by @lejelly. I tried reproducing the results from Table 1.b with no success. The base models perform similar to those of @lejelly but simple model averaging, permutation and ZipIt! seem to fail quite drastically (see photo below where I average the accuracies of 4 runs).

In addition to the modifications noted by @lejelly I believe I also:

Set "use_clip = True" (line 27 of cifar_resnet_training.py)

I also slightly changed the file names in the evaluation scripts since I was getting an error with the original code (the evaluation scripts were looking in the wrong, un-existing, directory for the checkpoints). This change is minor however and shouldn't affect the functioning of the rest of the code.

@lejelly have you figured out what the problem was by any chance? If not, @gstoica27 do you have any idea as to what might be happening?

FYI I have also tried running the training and evaluation for a ResNet20x16 with logits and I get the following results which do seem to make more sense:

Since the code runs without any errors / warnings and the logits version seems to work as expected could it be a problem with the environment, specifically the clip package? This is just a hypothesis, but I did run into some issues when trying to re-create your coding environment so it's possible I have slightly different versions for some libraries. Is it just me or @lejelly did you also run into some problems for that part?

Thanks a lot for your work, time and consideration!

lejelly commented 1 year ago

Hi @shoroi !

Set "use_clip = True" (line 27 of cifar_resnet_training.py)

Yes, I also did it.

@lejelly have you figured out what the problem was by any chance? If not, @gstoica27 do you have any idea as to what might be happening?

I'm sorry but I have no idea.

Is it just me or @lejelly did you also run into some problems for that part?

I don't remember clearly, but I think I encountered an error during the environment setup. However, since I was able to run it without any errors, I didn't pay much attention to it. (I apologize if I'm mistaken).

shoroi commented 1 year ago

Hey @lejelly thanks a lot for the quick reply! Have you tried running the same experiments but with logits instead of CLIP? If so, do you get similar results to the ones I posted above?

Thanks a lot, cheers!

lejelly commented 1 year ago

@shoroi

Have you tried running the same experiments but with logits instead of CLIP? If so, do you get similar results to the ones I posted above?

I have never done it. I'm sorry but I have already give up to reproduce this experiment by myself. So please ask @gstoica27 .

gstoica27 commented 1 year ago

Hi All,

I'm so sorry for this and the length of time I've gone without addressing this, could you try re-running the experiment with a LR of 0.4 and for 200 epochs?

So specifically, change line 340 in utils.py to:

optimizer = SGD(model.parameters(), lr=0.4, momentum=0.9, weight_decay=5e-4)

And set the total number of epochs for training to be 200?

I just verified that this hyperparameter setup yields models that replicates our reported experiments. I'm so sorry again, we had changed the SGD lr parameters and training epochs for internal experimentation and never changed it back.

Please let me know if this works. I'll push an update to the repo with this as well.

shoroi commented 1 year ago

Hey @gstoica27,

Thanks a lot for the response! I re-ran the experiments with the suggested modifications to the learning rate and number of epochs and here are my results:

This makes a lot more sense and seems to be comparable to the results from the paper! The only notable difference is that my ZipIt 20/20 results are significantly better than the ones from the paper and are more comparable to the ZipIt! 13/20 reported ones. Is the ZipIt! evaluation script by default set to use the multi-head approach? This shouldn't be the case since in the ZipIt! .csv file it says that the stop node is 21. Is it maybe another one of the hyper-parameters that would explain this difference, such as the same-model merges budget that differs between the reported experiments and my own experiments ran with the default code? If not then perhaps my random seeds were just easier to merge, a hypothesis which would be supported by the fact that the Permute Baseline also performs higher here than reported in the paper.

A couple of notes before officially closing down this issue:

It is somewhat surprising to me that the ZipIt! method fails if the LR and number of epochs aren't optimal. In my original results the base models still achieved >60% accuracy on their respective tasks, I would've expected the ZipIt! method to perform at least slightly better than (or the same as) the base models but instead it seems to "crash". Any hypothesis as to why this might happen?
In addition to the modifications mentioned in my other comments I also had to change the following (see list below). Without these change it seemed the evaluation scripts weren't looking in the correct directory for the saved model checkpoints. @gstoica27 feel free to comment on these modifications if you think I misunderstood something about the original code.
- I changed 'dir': './checkpoints/cifar50_traincliphead/' to 'dir': './checkpoints/cifar50_clip/' in the config file
- I re-defined the model directory in the raw_config variable of each one of the evaluation scripts to raw_config['model']['dir'] = f"{raw_config['model']['dir']}{raw_config['model']['name']}/pairsplits"

Again, thanks a lot for the great work and code and for your answer to this issue!

shoroi commented 1 year ago

For anyone that might be interested, here's an update to my previous comment.

I spent some more time looking through the code and I realized that the ResNet20 "graph" has a lot more than 21 nodes. Therefore stop_node in the zipit hyperparameter search and evaluation scripts should be set to None to get the full ZipIt! merging. I re-ran those scripts withstop_node=None and used the hyperparameters (alpha and beta parameters from the paper) with the best joint accuracy from the hyperparameter search to run the evaluation (alpha=0.5 and beta=1 in my case). Here are the updated results:

My results are now comparable to those reported in the paper. (except for the permute baseline which is slightly better in my case but that could just be due to the initialization of these networks)

jujulili888 commented 7 months ago

Hey @shoroi,

May I ask what the parameters you used in the evaluation stage looked like when you ran the training and evaluation for a ResNet20x16 with logits? Since I still failed to reproduce the logits version on both ResNet20-CIFAR5 ResNet20-CIFAR50 experiments. Are they default settings like the following:

gstoica27 / ZipIt

Low accuracy of merged model #13