Open codetreras opened 1 year ago
Thank you! What differs DistilBERT from BERT?
You're welcome, the project is awesome. The main differences are the configuration and the layers' identifiers. Architecturally, DistilBert has no token type embeddings or pooler. Check this image, in blue the equivalent layers, in orange the dissimilar ones.
At the beginning I thought about including DistilBert as a "variation" of Bert, however it would increase considerably the complexity of the code, here redundancy is necessary to make maintenance easier, let me know your thoughts.
@marco-nicola what do you think friend? I’ll go for it but a bit worried about code duplication for just a few differences.
Preferably just use the DistilBERT config (extend code in BERT) so there's no need for duplicate code.
Got it, in that case extending the converter/preprocessing.go and converter/mapper.go for BERT would be the proper way to manage the differences in layer identifiers, together with the configuration. Let me know what you think, I can modify the PR for you to check this approach.
I'm looking into supporting flan-t5-*
but so far I'm stuck since there are differences in the positional encoder (different weight key) so it currently fails when prompting due to some input being nil (it seems the second time round).
@mooijtech I am in vacation with family so it is a bit difficult for me to follow up on this now. I'll back to you next week and we'll figure it out together how to proceed with flan-t5-*
!
Based on the Bert's code for language modeling and text encoding tasks, these changes add support for DistilBert architecture #7 .