[Meta] discussion about weight management: should we allow ported weights

LukeWood commented 2 years ago

**Original Post by @sayakpaul***

I agree that rescaling to [0, 1] is way simpler and easier to do but a significant amount of models could be supported off-the-shelf with this consideration I believe.

Expanding more on this comment and also summarizing what we (@LukeWood and I) discussed offline.

While supporting a model with a training script to reach SoTA numbers is a great feature to have I believe it introduces a significant amount of friction and redundancy too. Let me explain.

When the pre-trained parameters of a model are available officially but not in the expected format, I think it makes sense to just port those parameters so that they can be loaded into the Keras implementation. Refer to this repository as an example: https://github.com/sayakpaul/keras-convnext-conversion/. As far as I know, this strategy was followed for a number of models under keras.applications. ResNet-RS, EfficientNetV2, ConvNeXt, for example. This strategy also allows us to seamlessly convert the pre-trained checkpoints that are for bigger datasets like ImageNet-21k. Repeating the pre-training with such datasets will again be time-consuming and repetitive work given the official parameters are available.

Furthermore, nowadays, researchers have started pre-training with self-supervision and sem-supervision and they are often able to surpass what's possible with standard supervised pre-training. If we allow the addition of models populated with pre-trained official parameters then we can factor this in too. Otherwise, figuring out the nitty-gritty of a particular pre-training technique can be quite challenging. Of course, having the pre-training script (ensuring implementation correctness) should still be welcomed.

IMO, if the models exported in this way are able to match the metrics reported in the official sources (official repositories, corresponding publications, etc.) and we are able to get sufficiently decent performance on downstream tasks, it should suffice for validating the implementation. It also helps the community experiment with these models faster.

Another point to consider is that not all models are trained using one end-all recipe.

Originally posted by @sayakpaul in https://github.com/keras-team/keras-cv/issues/476#issuecomment-1151219108

bhack commented 2 years ago

If we are going in a direction where foundarional like models or their distilled/pruned version are going to be few-shot fine-tuned on the user task/dataset It Is more important that we have scripts to few-shot/finetune and test these on another dataset.

The problem about converting the weights Is that It is often quite a collection of hacks. Do you think that we could organize some utils to support this activity?

sayakpaul commented 2 years ago

If we are going in a direction where foundarional like models or thir distilled/pruned version are going to be few-shot fine-tuned on the user task/dataset It Is more important that we have scripts to few-shot/finetune and test these on another dataset.

It's a bit unclear. Could you elaborate?

The problem about converting the weights Is that It is often quite a collection of hacks. Do you think that we could organize some utils to support this activity?

Hugging Face has a pipeline in place. See https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_tf_pytorch_utils.py. But it also requires users to write the models in Keras in a certain way so that the PyTorch params can be loaded into them (writing components as a tf.keras.layers.Layer instead of a tf.keras.Model).

ariG23498 commented 2 years ago

I feel like this is a very important feature to welcome.

Some really relatable pointers:

Pre-training the model (foundational model or any model for that matter) is really tricky. While there can be recepies mentioned in the paper, trying to recreate what the authors have done seems redundant to me.
We would need to validate the implementation to the very core. The way @sayakpaul has validated the individual models in this repository (https://github.com/sayakpaul/probing-vits) helped me understand that porting weights is not the end game.
Hugging Face is already doing something similar. But we both (me and @sayakpaul) think there are better ways of porting the weights to a keras friendly manner.

But it also requires users to write the models in Keras in a certain way so that the PyTorch params can be loaded into them (writing components as a tf.keras.layers.Layer instead of a tf.keras.Model).

This only goes on to say that we will not focus on the torch way of loading anymore. These models (the ones that will be ported) are TensorFlow and Keras specific which will help us harness the goodness of the platform.

sebastian-sz commented 2 years ago

It is hard to pick a side.

On one hand, requiring to reproduce the weights is a direct proof that the process can be reproduced using keras_cv components. It also helps to combat the reproducibility crisis - there is no case that the weights have been trained by someone somewhere and they only exist on e.g. Google Drive. We are also free to implement our own, standardized preprocessing - this is not the case when we use someone else's weights.

On the other hand, as others have mentioned, reproducing the weights is often quite a challenge. It is often easier to simply create (and maintain) a weight conversion script, rather than implement and reuse all the components used for training the models. This also makes a process much quicker and low cost - an important factor, especially for larger image models and contributors who do not have access to cloud resources.

My guess is that it's a tradeoff: more models but less reproducible/customizable or fewer models but more reproducible and customizable.

bhack commented 2 years ago

If we are going in a direction where foundarional like models or thir distilled/pruned version are going to be few-shot fine-tuned on the user task/dataset It Is more important that we have scripts to few-shot/finetune and test these on another dataset.

It's a bit unclear. Could you elaborate?

What I meant here is that if we are going in a direction of having quite general models that are too large to be trained from scratch also with a GCP or GKE CI training job it is more important if we have working scripts to few-shot/fine.tune these models on another dataset.

As we could not know what the users datasets are we could just have tests and scripts on a downstream proxy task/dataset.

If we consider Florence:

Florence is designed to effectively adapted in the open world via few-shot and zero-shot transfer learning, with the ability of efficient deployment by extra training with few epochs (e.g. in retrieval). Our model can be customized for various domains that application developers can use.

https://www.microsoft.com/en-us/research/project/project-florence-vl/

Efficiency: In terms of efficiency, we have developed MiniVLM and DistillVLM that aim to distill knowledge from a large teacher model for model compression. Further, in our recent VL tickets paper, we also investigate the parameter redundancy of these large-scale models via the lens of the lottery ticket hypothesis.

Or the follow-up Unicl at CVPR 2022

Can we have reproducible few-shot CI jobs on these models?

Hugging Face has a pipeline in place. See https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_tf_pytorch_utils.py. But it also requires users to write the models in Keras in a certain way so that the PyTorch params can be loaded into them (writing components as a tf.keras.layers.Layer instead of a tf.keras.Model).

Yes the point here is: can we do something similar or better here to not let every user to write from scratch its own sparse hacks?

Probably these are not the most fair stats but honestly we have an high probability that we are going to port official/reference weights or implementations from another framework to contribute an e2e network arch here.

sayakpaul commented 2 years ago

Can we have reproducible few-shot CI jobs on these models?

Yea why not. Setting up a CI/CD pipeline for getting downstream performance should be fairly straightforward as we know how to do that in separate scripts or notebooks.

Probably these are not the most fair stats but honestly we have an high probability that we are going to port official/reference weights or implementations from another framework to contribute an e2e network arch here.

I don't foresee a problem if someone wanted to refer to the official implementation while contributing a model here. If an official implementation is available then referring to it for ensuring implementation correctness should be encouraged in my opinion.

I probably should have clarified more in my original comment. I am NOT suggesting against the training scripts for reproducibility.

more models but less reproducible/customizable

@sebastian-sz could you elaborate why adding models populated with pre-trained params are less reproducible / customizable?

bhack commented 2 years ago

I don't foresee a problem if someone wanted to refer to the official implementation while contributing a model here. If an official implementation is available then referring to it for ensuring implementation correctness should be encouraged in my opinion.

What I meant here is that if will not have too much Keras reference models as a starting point from the research at the same time we don't want to require too much sparse hacks from contributors to converts these models weights.

So the point here is again what kind of tools or tutorials we could collected here to help and accelerate this recurrent activity? Will think it will be a purely manual craft activity? Can we create something to support, partially automate or speed up the process?

bhack commented 2 years ago

It is hard to pick a side.

On this my opinion is that it really depend on the model size. So we need to define what kind of models sizes are tractable by or CI jobs.

Yea why not. Setting up a CI/CD pipeline for getting downstream performance should be fairly straightforward as we know how to do that in separate scripts or notebooks.

I have some doubt also that this is always possible for large models. What kind of resources we need or an user needs to fine-tune or few-shot these models. Do we need to do it on a pruned/distilled version?

sayakpaul commented 2 years ago

Please note that I'm not talking about weight porting here anymore. It's purely about the implementation. Referring to the original model paper, framework documentation, official implementation, etc. -- these are anyway required for someone to implement an architecture. I don't know of other ways for references. Suggestions are welcome.

I have some doubt also that this is always possible for large models. What kind of resources we need or an user needs to fine-tune or few-shot these models. Do we need to do it on a pruned/distilled version?

If we can set a limit on model size (1B parameter model for example) then it's easy to estimate the hardware infra needed to run transfer learning or zero-shot evaluation. But we're starting with classification models (or even other models like segmentation, detection, etc.) I don't think we need a separate pipeline for zero-shot evaluation anyway.

bhack commented 2 years ago

Please note that I'm not talking about weight porting here anymore. It's purely about the implementation. Referring to the original model paper, framework documentation, official implementation, etc. -- these are anyway required for someone to implement an architecture. I don't know of other ways for references. Suggestions are welcome.

Instead I was still talking about the pure weights porting phase. For rebuilding the arch I think we could just have few tutorial/examples or best practices on how to approach to the task especially as I suppose that we want to be modular in this repo. So we could have also some basic policy on how to contribute missing components to the library when we are going re-implement or port a specific network arch.

E.g. in the Keras-cv library:

do we have this loss?
do we have this layers?
do we have this metric?
do we have this optimizer?
do we have this activation?
etc...

Having fine-tuning script is possible and Hugginface is already doing this on some vision models but yes it really depend on the model size and we need to define some limits for our CI:

https://huggingface.co/blog/fine-tune-vit

sebastian-sz commented 2 years ago

@sebastian-sz could you elaborate why adding models populated with pre-trained params are less reproducible / customizable?

@sayakpaul Yes, I think that in the case of weights transfer the reproducibility and preprocessing are both tied to the reproducibility and preprocessing used in the original repository. If the original scripts are not reproducible, I'm not sure if we can consider converted weights reproducible. This is not always the case.

If they use certain preprocessing, then the same preprocessing must be used in this implementation.

I am not against this approach - I have also done weights transfer a few times. I still think this is a tradeoff.

sayakpaul commented 2 years ago

Right on! I agree.

If the original scripts are not reproducible, I'm not sure if we can consider converted weights reproducible. This is not always the case.

Do you mean if the original parameters fail to produce the reported numbers? If so, I think a fair thing to do is to vet those parameters first before proceeding. In my experience, luckily, I haven't encountered non-reproducible original parameters.

If they use certain preprocessing, then the same preprocessing must be used in this implementation.

For evaluation (on ImageNet-1k validation set, let's say), yes!

I am not against this approach - I have also done weights transfer a few times. I still think this is a tradeoff.

Oh absolutely. I can cite your work on ResNet-RS (amongst other fews) countless times.

bhack commented 2 years ago

If the original scripts are not reproducible, I'm not sure if we can consider converted weights reproducible. This is not always the case.

An extra point is: if we have not trained the network from scratch with our job can the fine-tune/few-shot job test on the downstream task/dataset be a good enough proxy for testing the reproducibility?

sayakpaul commented 2 years ago

If the models (ported with the original parameters) can

yield the same metrics (drop of 0.5%-1% is still acceptable, happens all the time) on the validation dataset as reported in the official resources
yield sufficiently good performance on a downstream task

I do think that's validation enough for the architecture and also the ported parameters. If not, I'd love to know about other ways. As far as training is concerned I think it's a separate component for testing / validation.

I think mentioned it in my original comment too.

bhack commented 2 years ago

What I mean is that if you have just ported the weights on your first point you can only use the second point to validate "the learning" part of the model (e.g. preprocessing and all the learning related hyperameters).

When the size of the model is too large for the "learning reproducibility" we need to rely only on point two.

So my point is: Can we just unify the learning reproducibility of the porting weights case (where we could still theoretically run a training job from scratch in the CI) and the too large models to be trained case just with the downstream task metrics/tests?

sayakpaul commented 2 years ago

Can we just unify the learning reproducibility of the porting weights case (where we could still theoretically run a training job from scratch in the CI) and the too large models to be trained case just with the downstream task metrics/tests?

If training from scratch can be done then no need to worry about this point in the first place.

What I mean is that if you have just ported the weights on your first point you can only use the second point to validate "the learning" part of the model (e.g. preprocessing and all the learning related hyperameters).

I don't think so. Ensuring the ported model meets the desired number should also be considered.

bhack commented 2 years ago

If training from scratch can be done then no need to worry about this point in the first place.

This requires that we could expect always a traing job/scripts for model <= treshold flops/size.

I don't think so. Ensuring the ported model meets the desired number should also be considered.

Here I was talking about what it is exclusively related to the learning phase not the inference.

bhack commented 2 years ago

Just to make another CVPR 2022 few-shot use case: https://github.com/microsoft/GLIP

Can we verify, as a proxy, just the reproducibility of the few-shot task with a CI job?

LukeWood commented 2 years ago

I have been wondering:

Is it mathematically possible to transform weights to expect a different value range and still function? I believe that this should be possible, right?

sayakpaul commented 2 years ago

@LukeWood I guess it's okay to close the loop on this since there's work already started to obtain good ImageNet-1k checkpoints.

bhack commented 2 years ago

We could remove this from the pinning.

keras-team / keras-cv

[Meta] discussion about weight management: should we allow ported weights #488