an inquiry about your knowledge-distil-bert talk

hadywalied commented 3 years ago

Hello, I'm Hady an ECE student at cairo university school of engineering, I've been working on a distilled version of a text summarization model called pegasus, I found your L3-AI talk on YouTube and I'll be really thankful if you answer some of my questions.

I was wondering if the technique you used to distill bert was applied on a downstreamed version of the model or on the BERT_large version, meaning that if I have a model that is pre trained on the c4 dataset for example, and then downstreamed on the summarization task by training it on gigaword dataset, when I try to distill the model should I start with the pretrained version or the downstream version?

Also, I wanted to know if I used a Teacher Assistant technique. Should it have the same architecture as the teacher model or can I alter the architecture?

I look forward to your kind reply.

samsucik commented 3 years ago

Hi Hady,

First, sorry for the late reply.

In general, you can find a lot more details about my work in the actual thesis report https://github.com/samsucik/knowledge-distil-bert/blob/master/latex/s1513472-minf2.pdf (and, of course, all the code is in that repo as well).

I was using BERT_base, fine-tuning it on downstream tasks, and then distilling the fine-tuned model into student models. To your question (whether to distil the pre-trained or the fine-tuned model), the answer really depends on your situation. Have a look at section 2.3.2 in the report where I mention some advantages and disadvantages of each option :-)

As for teacher assistants, the intermediate model (the assistant) doesn't have to have the same architecture as the teacher. It's the same as with normal knowledge distillation, where the student can have a different architecture from the teacher. However, in some cases there can be advantages of having the same architecture: Especially if you're trying in your distillation to match not just the teacher's logits, but also its internal representations. Indeed, an LSTM student/assistant cannot learn well the internal representations encoded in a transformer-based teacher's self-attention heads, because there is no such component in an LSTM.

Let me know if you've got more specific questions, I'll try to reply faster this time.

Best of luck!

Sam

On Tue, 23 Mar 2021 at 12:26, Hady walied @.***> wrote:

Hello, I'm Hady an ECE student at cairo university school of engineering, I've been working on a distilled version of a text summarization model called pegasus, I found your L3-AI talk on YouTube and I'll be really thankful if you answer some of my questions.

I was wondering if the technique you used to distill bert was applied on a downstreamed version of the model or on the BERT_large version, meaning that if I have a model that is pre trained on the c4 dataset for example, and then downstreamed on the summarization task by training it on gigaword dataset, when I try to distill the model should I start with the pretrained version or the downstream version?

Also, I wanted to know if I used a Teacher Assistant technique. Should it have the same architecture as the teacher model or can I alter the architecture?

I look forward to your kind reply.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samsucik/knowledge-distil-bert/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY7YZQMQBKKQIOZEA5Y2L3TFCCI5ANCNFSM4ZVCMFWQ .

hadywalied commented 3 years ago

Hi Sam, I'm really thankful for your reply, I've read the report and went through the scripts in the repo, and lately I've been working on the pseudo labeling, and I've planned on using the same transformer-based arch as the teacher for both the TA and the student, and I thought of copying the layers from the teacher to the student as initial weights, I'll give things a try and see what I shall get, and if I get stuck I'll be in touch with you. Thank you again, and I Hope you have great days ❤️

samsucik commented 3 years ago

Ah, right, I see. Let me know how things will go for you.

As for initialising the student from the teacher's weights, there's the obvious downside to it (the student will have to be as "wide" as the teacher), but there's also some potential "magic" involved (to be explored?), especially when one thinks about what the DistillBERT authors did: They took the teacher's weights but left out some layers. If you have trained layers A, B, C, and you only take A, C, then they shouldn't really fit on top of each other (in terms of the "neural pathways"), right? But the authors did that anyway. And then trained such an initialised student further, thus maybe repairing the "broken" neural pathways, or maybe just utilising the structure that had been present in each of the teacher's layers individually (as a result of pre-training), or maybe the initialisation wasn't actually needed at all... I'd be very curious to see if you get some advantage by initialising the students from the teacher (compared to random initialisation).

On Sun, 28 Mar 2021 at 16:27, Hady walied @.***> wrote:

Hi Sam, I'm really thankful for your reply, I've read the report and went through the scripts in the repo, and lately I've been working on the pseudo labeling, and I've planned on using the same transformer-based arch as the teacher for both the TA and the student, and I thought of copying the layers from the teacher to the student as initial weights, I'll give things a try and see what I shall get, and if I get stuck I'll be in touch with you. Thank you again, and I Hope you have great days ❤️

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samsucik/knowledge-distil-bert/issues/3#issuecomment-808912064, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY7YZUX273KVUTCDEUFB6TTF5DGDANCNFSM4ZVCMFWQ .

hadywalied commented 3 years ago

got it, I need to investigate more on the effects of copying the teacher's weights. the idea is that I'm trying to make a survey paper on the distillation in the Abstractive summarization task and I took PEGASUS as a test example, so I still don't know if it's a good test subject or not. but it's still a matter of trying, right? as for the copying mechanism I intend to use, I mimicked the "copying alternating layers" that huggingface used in distilling T5, but still yet not used it. I shall try it and compare results, and I'll keep you updated.

samsucik / knowledge-distil-bert

an inquiry about your knowledge-distil-bert talk #3