helpmefindaname / transformer-smaller-training-vocab

Temporary remove unused tokens during training to save ram and speed.
https://helpmefindaname.github.io/transformer-smaller-training-vocab/
MIT License
20 stars 2 forks source link

New Tokenizer for mdeberta-v3-base #14

Open zynos opened 4 months ago

zynos commented 4 months ago

Thank you for your repo!

Would it be possible to add a tokenizer for this model? https://huggingface.co/microsoft/mdeberta-v3-base

Thanks in advance :)

helpmefindaname commented 3 months ago

Hi @zynos and sorry for late response. I just now had time to look into this, the problem with the DebertaV2Tokenizer lies in the heavy dependence on sentencepiece and I am not familiar with a good way to overwrite the vocabulary there.

So I'll leave this issue open in case anyone knows the sentencepiece library better than I do and is willing to solve this problem.