gstoica27 / ZipIt

A framework for merging models solving different tasks with different initializations into one multi-task model without any additional training
MIT License
286 stars 25 forks source link

Low accuracy of merged model #13

Open lejelly opened 1 year ago

lejelly commented 1 year ago

Hi, Thanks for your great work!

I attempted to reproduce the results in Table 1(b) of your paper.

スクリーンショット 2023-06-26 11 53 47

From my understanding, this experiment involves the following steps:

  1. Divide CIFAR100 into 50 classes each.
  2. Train two Resnet20 (widthx8) models separately.
  3. Validate the classification results after merging.

I believe Table 1(b) presents the average and standard deviation of these 1, 2, and 3 steps repeated four times.

In an effort to replicate this, I executed the code with minimal modifications. The modifications I made were:

As a result, the accuracy after merging decreased regardless of the merging method. Are there any hyperparameters that need to be taken into consideration?(params a and b are default setting)

スクリーンショット 2023-06-26 11 52 35

I appreciate once again that you are sharing this wonderful research with me.

lejelly commented 1 year ago

https://github.com/gstoica27/ZipIt/issues/11#issuecomment-1601626339

This question appears to be similar to mine.

shoroi commented 1 year ago

Hey @gstoica27, congrats on the great work and thank you for sharing this code base with us! I'm looking forward to using this repo to advance my own research.

Unfortunately I'm running into the same problem as the one stated above by @lejelly. I tried reproducing the results from Table 1.b with no success. The base models perform similar to those of @lejelly but simple model averaging, permutation and ZipIt! seem to fail quite drastically (see photo below where I average the accuracies of 4 runs).

image

In addition to the modifications noted by @lejelly I believe I also:

I also slightly changed the file names in the evaluation scripts since I was getting an error with the original code (the evaluation scripts were looking in the wrong, un-existing, directory for the checkpoints). This change is minor however and shouldn't affect the functioning of the rest of the code.

@lejelly have you figured out what the problem was by any chance? If not, @gstoica27 do you have any idea as to what might be happening?

FYI I have also tried running the training and evaluation for a ResNet20x16 with logits and I get the following results which do seem to make more sense: image

Since the code runs without any errors / warnings and the logits version seems to work as expected could it be a problem with the environment, specifically the clip package? This is just a hypothesis, but I did run into some issues when trying to re-create your coding environment so it's possible I have slightly different versions for some libraries. Is it just me or @lejelly did you also run into some problems for that part?

Thanks a lot for your work, time and consideration!

lejelly commented 1 year ago

Hi @shoroi !

Set "use_clip = True" (line 27 of cifar_resnet_training.py)

Yes, I also did it.

@lejelly have you figured out what the problem was by any chance? If not, @gstoica27 do you have any idea as to what might be happening?

I'm sorry but I have no idea.

Is it just me or @lejelly did you also run into some problems for that part?

I don't remember clearly, but I think I encountered an error during the environment setup. However, since I was able to run it without any errors, I didn't pay much attention to it. (I apologize if I'm mistaken).

shoroi commented 1 year ago

Hey @lejelly thanks a lot for the quick reply! Have you tried running the same experiments but with logits instead of CLIP? If so, do you get similar results to the ones I posted above?

Thanks a lot, cheers!

lejelly commented 1 year ago

@shoroi

Have you tried running the same experiments but with logits instead of CLIP? If so, do you get similar results to the ones I posted above?

I have never done it. I'm sorry but I have already give up to reproduce this experiment by myself. So please ask @gstoica27 .

gstoica27 commented 1 year ago

Hi All,

I'm so sorry for this and the length of time I've gone without addressing this, could you try re-running the experiment with a LR of 0.4 and for 200 epochs?

So specifically, change line 340 in utils.py to:

optimizer = SGD(model.parameters(), lr=0.4, momentum=0.9, weight_decay=5e-4)

And set the total number of epochs for training to be 200?

I just verified that this hyperparameter setup yields models that replicates our reported experiments. I'm so sorry again, we had changed the SGD lr parameters and training epochs for internal experimentation and never changed it back.

Please let me know if this works. I'll push an update to the repo with this as well.

shoroi commented 1 year ago

Hey @gstoica27,

Thanks a lot for the response! I re-ran the experiments with the suggested modifications to the learning rate and number of epochs and here are my results: image

This makes a lot more sense and seems to be comparable to the results from the paper! The only notable difference is that my ZipIt 20/20 results are significantly better than the ones from the paper and are more comparable to the ZipIt! 13/20 reported ones. Is the ZipIt! evaluation script by default set to use the multi-head approach? This shouldn't be the case since in the ZipIt! .csv file it says that the stop node is 21. Is it maybe another one of the hyper-parameters that would explain this difference, such as the same-model merges budget that differs between the reported experiments and my own experiments ran with the default code? If not then perhaps my random seeds were just easier to merge, a hypothesis which would be supported by the fact that the Permute Baseline also performs higher here than reported in the paper.

A couple of notes before officially closing down this issue:

Again, thanks a lot for the great work and code and for your answer to this issue!

shoroi commented 1 year ago

For anyone that might be interested, here's an update to my previous comment.

I spent some more time looking through the code and I realized that the ResNet20 "graph" has a lot more than 21 nodes. Therefore stop_node in the zipit hyperparameter search and evaluation scripts should be set to None to get the full ZipIt! merging. I re-ran those scripts withstop_node=None and used the hyperparameters (alpha and beta parameters from the paper) with the best joint accuracy from the hyperparameter search to run the evaluation (alpha=0.5 and beta=1 in my case). Here are the updated results:

image

My results are now comparable to those reported in the paper. (except for the permute baseline which is slightly better in my case but that could just be due to the initialization of these networks)

jujulili888 commented 7 months ago

Hey @shoroi,

May I ask what the parameters you used in the evaluation stage looked like when you ran the training and evaluation for a ResNet20x16 with logits? Since I still failed to reproduce the logits version on both ResNet20-CIFAR5 ResNet20-CIFAR50 experiments. Are they default settings like the following:

image