Ideas from the wav2vec2 repo

sayakpaul commented 2 years ago

Initial action plans

Copying these things from the wav2vec2 repo for safe housekeeping.

An immediate quantize could be to convert the fine-tuned model using TFLite APIs. Post-training quantization, in specific, might be very useful. Quantization-aware training might be even more helpful but its support on TPUs is limited. I remember you had tried post-training quantization but the resulting model size was around 400 MB and I had shared some thoughts around that. Might be a good idea to again revisit post-training quantization in that case.
Google Research recently published FRILL which could be relevant for us. Basically, they perform knowledge distillation with a smaller student model with careful design choices along with quantization-aware training.
Meanwhile, if you have any other ideas that you think might be worth trying out please feel free to share them. If we have anything concrete and novel we can even target a publication in that case.

Suggesting another important resource here: Knowledge distillation: A good teacher is patient and consistent. The paper introduces simple recipes to get the best possible student model. But the study is based on image classification models. So, might be a fun exercise to try to think of ways in which this could be extended here.

A baseline approach to distil Wav2Vec2: Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

Other useful resources

Model Optimization

Efficient Methods and Hardware for Deep Learning by Song Han Lecture on Quantization by Pete Warden

For non-trivial model conversions in TFLite you can refer to the following repositories

https://github.com/tulasiram58827/ocr_tflite/ https://github.com/tulasiram58827/TTS_TFLite https://github.com/sayakpaul/Adventures-in-TensorFlow-Lite

thevasudevgupta commented 2 years ago

Questions

Does teacher model need to be trained with mixup to be able to apply mixup during knowledge distillation stage (function matching)?

Major challenges

It's hard to get code for pre-training Wav2Vec2. HuggingFace has put some effort on this but hasn't succeeded yet. (see 1, 2, 3)
Spec augmentation will prevent us to do consistent teaching as suggested in the paper (A good teacher is patient and distiller)

Some ideas

Can we do knowledge distillation after multiple layers and not just after final layer. Model will be penalized at layer level and overall objective will be to make every layer in distil-Wav2Vec2 equivalent to every 2 layers of original Wav2Vec2.

Possible Plan

Without pre-training code, distil BERT, mobile BERT training strategy rules out. And we will have to distil finetuned checkpoint only :(

sayakpaul commented 2 years ago

Does teacher model need to be trained with mixup to be able to apply mixup during knowledge distillation stage (function matching)?

This is actually a good research question to ask. In the vision literature, we generally always start with a good teacher model. Noisy Student, MEAL (v1, v2), etc. all follow that approach.

In this regard, there's work on Noisy Student extended to the domain of speech-reco from Google: Improved Noisy Student Training for Automatic Speech Recognition.

However, none of these approaches use MixUp for any kind of interpolation.

It's hard to get code for pre-training Wav2Vec2. HuggingFace has put some effort on this but hasn't succeeded yet. (see 1, 2, 3)

Oh is it! I thought FAIR had open-sourced that too.

Spec augmentation will prevent us to do consistent teaching as suggested in the paper

I see. Could you briefly specify what Spec aug is doing? I will have a better understanding by then.

Can we do knowledge distillation after multiple layers and not just after final layer. Model will be penalized at layer level and overall objective will be to make every layer in distil-Wav2Vec2 equivalent to every 2 layers of original Wav2Vec2.

We can definitely do that. Maybe we divide the student and teacher architectures into logical groups. In vision, these groups are often called pre-stem, stem, trunk, pre-final, and final. What we could then do is take the last layer from each of these groups from the teacher and have it matched with that of the student. Matching each and every layer might be computationally challenging. So, how would we do this logical grouping remains to be explored.

thevasudevgupta commented 2 years ago

This is actually a good research question to ask. In the vision literature, we generally always start with a good teacher model. Noisy Student, MEAL (v1, v2), etc. all follow that approach.

I will read above papers.

However, none of these approaches use MixUp for any kind of interpolation.

We can try this out then.

Oh is it! I thought FAIR had open-sourced that too.

It's open sourced and HuggingFace also having it, but it's having some kinda bug (which is not figured out yet) and many in HuggingFace community are unable to train a good pre-trained model :(

So should we go with distilling the fine-tuned model for now??

I see. Could you briefly specify what Spec aug is doing? I will have a better understanding by then.

It's just masking the along time span / feature span (So the outputs of convolutional layers are masked at few continuous time steps).

But spec augmentation can still be used with consistent Knowledge distillation if we keep the convolutional layers (feature extractor) layers common in teacher & student. This should be possible as we probably don't need to train Convolutional layers (Suggesting this as during fine-tuning stage also, we generally don't train them & keep them frozen, so here also we can keep them frozen possibly)

We can definitely do that. Maybe we divide the student and teacher architectures into logical groups. In vision, these groups are often called pre-stem, stem, trunk, pre-final, and final. What we could then do is take the last layer from each of these groups from the teacher and have it matched with that of the student. Matching each and every layer might be computationally challenging. So, how would we do this logical grouping remains to be explored.

Yeah. Possibly we can try out naive approaches like 1st layer of distilled model == 1st 2 layers of original wav2vec2 and similarly for other layers.

sayakpaul commented 2 years ago

It's open sourced and HuggingFace also having it, but it's having some kinda bug (which is not figured out yet) and many in HuggingFace community are unable to train a good pre-trained model :(

So the pre-training scripts provided by FAIR are buggy. That's concerning.

But spec augmentation can still be used with consistent Knowledge distillation if we keep the convolutional layers (feature extractor) layers common in teacher & student. This should be possible as we probably don't need to train Convolutional layers (Suggesting this as during fine-tuning stage also, we generally don't train them & keep them frozen, so here also we can keep them frozen possibly)

I see. But when we are distilling a teacher into a student, the student needs to be trained from scratch right? With respect to this context, I am not sure if this would still apply - "we probably don't need to train Convolutional layers". Or am I missing out on something?

Yeah. Possibly we can try out naive approaches like 1st layer of distilled model == 1st 2 layers of original wav2vec2 and similarly for other layers.

That makes sense. Added on top of that if we could also consider the design choices realized by FRILL (Google Brain) maybe we can have something smaller that performs good enough on mobile platforms as well.

thevasudevgupta commented 2 years ago

I see. But when we are distilling a teacher into a student, the student needs to be trained from scratch right? With respect to this context, I am not sure if this would still apply - "we probably don't need to train Convolutional layers". Or am I missing out on something?

Since transformers part having most of the parameters, we can aim to distil it only I think (reference this paper). Now, if we don't need to distil convolutional layers, then we can initialize student's Convolutional layers with the teacher's Convolutional layers and keep them frozen during training.

sayakpaul commented 2 years ago

I see that is interesting. I think these questions will be well answered once we start the experimentation process.

Some immediate ideas that come to mind:

Replacing the Transformer blocks to use linear complexity inside the Teacher.
We can quantify the impact of the convolutional layers when performing inference and based on that we can decide if we need to modify those blocks to further compress the model. A common belief is because Depthwise Separable convs have fewer parameters they would yield better inference results, particularly on mobile hardware. But this is often not true: MobileDets, RegNets, Fast and Accurate Model Scaling.

sayakpaul commented 2 years ago

Another relevant work: https://arxiv.org/abs/2108.10197.

thevasudevgupta commented 2 years ago

Hey @sayakpaul, sorry for keeping this project on hold earlier. I will try to put initial detailed plan by tomorrow and then we can have more discussion on that (or possibly start experimentation process)!!

sayakpaul commented 2 years ago

Totally, Vasudev. I appreciate that.

sayakpaul commented 2 years ago

@vasudevgupta7 here's another and (probably the final idea) I'd provide before the experimentation loop begins:

This is more along the lines of DistilBERT where the aim is to distill the original pre-trained model into a smaller one. My idea is what if we apply something similar to compress the original wav2vec2 model? The distillation objective would still be roughly the same i.e., to match the outputs produced by the teacher model instead of the original pre-text task of wav2vec2. I think this would allow us to incorporate the well-known recipes from the knowledge distillation literature.

Moreover, we often only consider the high-confidence predictions produced by the teacher model for training the student model especially when we have unlabeled samples in the loop. We could incorporate such a scheme into our pipeline and see if that's beneficial.

Finally, works like AdaMatch make the thresholding scheme adaptive. Instead of fixating on a manually chosen threshold value, we first start with an initial seed value and then anneal from there.

Let me know if anything is unclear.

thevasudevgupta commented 2 years ago

Experiment-1

Distillation of pre-trained model

Original Wav2Vec2 was pre-trained with one head on the top (Note: we don’t have code to that head yet), so if we want to do knowledge distillation we will have to do that without the head. Also, distilling pre-trained model directly will have following problem: pre-trained BERT can be distilled easily as pre-training task & down-stream tasks both are handling text. But in Wav2Vec2, pre-training is involving speech only, so if we distil pre-trained model and then fine-tune that distilled model for speech->text mapping, model may perform bad as capacity of model is decreased and it needs to learn a lot on text.

Experiment-2

Distillation of fine-tuned model

Directly distill Wav2Vec2 and see what performance we get with this simple approach

Experiment-3

Introduce mix-up as suggested in paper (Knowledge distillation: A good teacher is patient and consistent) and see if it improves the performance. Several papers suggest that augmentation is important during student-training. So this may help get better performance.

Experiment-4

Try out different student models

Try out ALBERT like Wav2vec2 in which layers have shared parameters. (ALBERT has only 10M parameters while BERT has 110M parameters), So if this works Wav2Vec2 size will be reduced by significant amount (in term of number of parameters)

Experiment-5

Knowledge distillation after group of layers. Now instead of just trying to match output distributions of final layer of teacher & student, we can try to match output distribution of some intermediate layer as well.

Experiment-6

Approximate transformers attention to linear attention. We can possibly replace Wav2Vec2 attention with BigBird attention (let’s say). But I suspect if it’s going to work as our mean sequence length is around 768 and these linear approximated models are generally built for larger sequences (>1024 atleast). But we can try this approach out on some other datasets (I will search those datasets when we will perform this experiment) which possibly has very long sequences.

Experiment-7

Check how can we make convolutional layers faster. All other experiments are focussing on transformer part of Wav2vec2, but we should checkout if we can do something with Conv layer. Here we can try out what you suggested on Depthwise Separable conv.

thevasudevgupta commented 2 years ago

@sayakpaul, sorry again for all the delay. I mentioned my experimentation plan in above comment based on our earlier discussion and few more ideas. I think I should start experimentation process now and we can further discuss as we go. Do you think if are able to get some of the above ideas working, we will be able to get paper at top NLP conferences like EMNLP, NeurIPS???

Here is the order, I am planning to follow:

[ ] Experiment-2
[ ] Experiment-3
[ ] Experiment-1
[ ] Experiment-4
[ ] Experiment-5
[ ] Experiment-6
[ ] Experiment-7

Feel free to re-arrange the experiments order if you think we should perform some experiment early.

thevasudevgupta commented 2 years ago

This is more along the lines of DistilBERT where the aim is to distill the original pre-trained model into a smaller one. My idea is what if we apply something similar to compress the original wav2vec2 model? The distillation objective would still be roughly the same i.e., to match the outputs produced by the teacher model instead of the original pre-text task of wav2vec2. I think this would allow us to incorporate the well-known recipes from the knowledge distillation literature.

This will be covered in Experiment-1 (mentioned above)

Moreover, we often only consider the high-confidence predictions produced by the teacher model for training the student model especially when we have unlabeled samples in the loop. We could incorporate such a scheme into our pipeline and see if that's beneficial.

@sayakpaul, Can you elaborate more on this?

Finally, works like AdaMatch make the thresholding scheme adaptive. Instead of fixating on a manually chosen threshold value, we first start with an initial seed value and then anneal from there.

I will go through this paper and include that in experiments.

sayakpaul commented 2 years ago

Original Wav2Vec2 was pre-trained with one head on the top (Note: we don’t have code to that head yet), so if we want to do knowledge distillation we will have to do that without the head. Also, distilling pre-trained model directly will have following problem: pre-trained BERT can be distilled easily as pre-training task & down-stream tasks both are handling text. But in Wav2Vec2, pre-training is involving speech only, so if we distil pre-trained model and then fine-tune that distilled model for speech->text mapping, model may perform bad as capacity of model is decreased and it needs to learn a lot on text.

Makes a lot of sense, @vasudevgupta7. Thanks for clarifying. If this is the case, then it would be indeed challenging to think of doing knowledge distillation in the pre-training phase.

Distillation of fine-tuned model

Fine-tuned on speech recognition?

Experiment-6

How about Linformer? In my view, it should not be a mandatory requirement to increase the mean sequence length in that case?

@sayakpaul, Can you elaborate more on this?

Let's say we have a pre-trained image classification model. Given an image, that is producing the softmaxed outputs like so: [0.56, 0.04, 0.3, 0.1]. This distribution does not contain any high-confidence predictions if we set the confidence threshold to 0.9 so we could roughly conclude that the model is not that confident about its predictions. It generally happens when the given input image is a difficult one to infer on. That can happen for a number of reasons. The image could be an anomaly, the image could be OOD, and so on. Of course, there are statistical methods to quantify uncertainty estimates but I hope this gives you an idea. But let me know if anything is still unclear.

Do you think if are able to get some of the above ideas working, we will be able to get paper at top NLP conferences like EMNLP, NeurIPS???

Definitely. EMNLP sounds like the best one to go for but I suspect if our work does not include a whole lot of text would it be still relevant?

sayakpaul commented 2 years ago

@vasudevgupta7 Facebook also launched this: https://ai.facebook.com/blog/textless-nlp-generating-expressive-speech-from-raw-audio. Mentioned it because it might be helpful as a pre-text objective in case we wanted to distill the pre-trained model in the first place.

thevasudevgupta commented 2 years ago

@sayakpaul, so sorry for very delayed response. There won't be any delay from now onwards.

Fine-tuned on speech recognition?

Yeah

How about Linformer? In my view, it should not be a mandatory requirement to increase the mean sequence length in that case?

I have few comments on this. But will put them when I will come to this experiment-6.

Definitely. EMNLP sounds like the best one to go for but I suspect if our work does not include a whole lot of text would it be still relevant?

Yeah, I will checkout more on speech conferences.

sayakpaul commented 2 years ago

https://www.interspeech2021.org/ seems like a good one.

thevasudevgupta / gsoc-wav2vec2