megagonlabs / bunkai

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
https://pypi.org/project/bunkai/
Apache License 2.0
185 stars 11 forks source link

update JanomeSubwordsTokenizer for transformers>=4.34 #336

Open mh-northlander opened 7 months ago

mh-northlander commented 7 months ago

fix #335.

From transformers>=4.34, PreTrainedTokenizer.__init__ requires self.vocab to be set. Move the super(...).__init__ call to the end of JanomeSubwordsTokenizer.__init__ (and use unk_token instead of self.unk_token before init), following changes of BertTokenizer at that time.

Also move the call of self.add_tokens since it requires super(...).__init__ is done.