huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.4k stars 27.09k forks source link

A fast tokenizer for BertJapaneseTokenizer #12381

Open dkawahara opened 3 years ago

dkawahara commented 3 years ago

We would like a fast tokenizer for BertJapaneseTokenizer. This is because the current token classification model (run_ner.py) requires using the fast tokenizer but BertJapaneseTokenizer does not have it. Because of this, we cannot do token classification for Japanese using cl-tohoku's BERT models.

KoichiYasuoka commented 2 years ago

Hi @dkawahara I've just written tentative BertJapaneseTokenizerFast:

from transformers import BertJapaneseTokenizer
class BertJapaneseTokenizerFast(BertJapaneseTokenizer):
  def __call__(self,text,text_pair=None,return_offsets_mapping=False,**kwargs):
    v=super().__call__(text=text,text_pair=text_pair,return_offsets_mapping=False,**kwargs)
    if return_offsets_mapping:
      import tokenizations
      if type(text)==str:
        z=zip([v["input_ids"]],[text],[text_pair] if text_pair else [""])
      else:
        z=zip(v["input_ids"],text,text_pair if text_pair else [""]*len(text))
      w=[]
      for a,b,c in z:
        a2b,b2a=tokenizations.get_alignments(self.convert_ids_to_tokens(a),b+c)
        x=[]
        for i,t in enumerate(a2b):
          if t==[]:
            s=(0,0)
            if a[i]==self.unk_token_id:
              j=[[-1]]+[t for t in a2b[0:i] if t>[]]
              k=[t for t in a2b[i+1:] if t>[]]+[[len(b+c)]]
              s=(j[-1][-1]+1,k[0][0])
          elif t[-1]<len(b):
            s=(t[0],t[-1]+1)
          else:
            s=(t[0]-len(b),t[-1]-len(b)+1)
          x.append(s)
        w.append(list(x))
      v["offset_mapping"]=w[0] if type(text)==str else w
    return v

But it requires pytokenizations module, and in fact it's not fast. See detail in my diary written in Japanese, and in next I will try to implement BatchEncoding.encodings

Ezekiel25c17 commented 1 year ago

@KoichiYasuoka Thank you very much for providing this work around. Without "return_offsets_mapping" option, it was always a pain in Japanese token classification tasks.

I would like to point out a little bug when processing text containing consecutive [UNK] tokens. e.g.,

text = "𠮟られても平気なの☺ ☺☺"
tokenizer=BertJapaneseTokenizerFast.from_pretrained("cl-tohoku/bert-base-japanese")
d=tokenizer(text,return_offsets_mapping=True)
for offset in d['offset_mapping']:
    print((offset[0], offset[1]), text[offset[0]:offset[1]])

would print out results like below

(0, 0) 
(0, 1) 𠮟
(1, 3) られ
(3, 4) て
(4, 5) も
(5, 6) 平
(6, 7) 気
(7, 8) な
(8, 9) の
(9, 13) ☺ ☺☺
(9, 13) ☺ ☺☺
(0, 0) 

I still can't figure out any solutions to improve the mapping approach for each [UNK] token. I am just wondering if you have any ideas on this issue. Many thanks.

KoichiYasuoka commented 1 year ago

Hi @Ezekiel25c17 I've just written BertMecabTokenizerFast:

from transformers import BertTokenizerFast
from transformers.models.bert_japanese.tokenization_bert_japanese import MecabTokenizer
class MecabPreTokenizer(MecabTokenizer):
  def mecab_split(self,i,normalized_string):
    t=str(normalized_string)
    z=[]
    e=0
    for c in self.tokenize(t):
      s=t.find(c,e)
      if s<0:
        z.append((0,0))
      else:
        e=s+len(c)
        z.append((s,e))
    return [normalized_string[s:e] for s,e in z if e>0]
  def pre_tokenize(self,pretok):
    pretok.split(self.mecab_split)
class BertMecabTokenizerFast(BertTokenizerFast):
  def __init__(self,vocab_file,**kwargs):
    from tokenizers.pre_tokenizers import PreTokenizer,BertPreTokenizer,Sequence
    super().__init__(vocab_file=vocab_file,**kwargs)
    d=kwargs["mecab_kwargs"] if "mecab_kwargs" in kwargs else {"mecab_dic":"ipadic"}
    self._tokenizer.pre_tokenizer=Sequence([PreTokenizer.custom(MecabPreTokenizer(**d)),BertPreTokenizer()])

derived from MecabPreTokenizer of deberta-base-japanese-juman-ud-goeswith. Does it work well?

Ezekiel25c17 commented 1 year ago

Hi @KoichiYasuoka Thank you for responding so quickly. This worked as a charm!

KoichiYasuoka commented 1 year ago

I've re-written BertMecabTokenizer to disable do_lower_case and tokenize_chinese_chars:

from transformers import BertTokenizerFast
from transformers.models.bert_japanese.tokenization_bert_japanese import MecabTokenizer
class MecabPreTokenizer(MecabTokenizer):
  def mecab_split(self,i,normalized_string):
    t=str(normalized_string)
    e=0
    z=[]
    for c in self.tokenize(t):
      s=t.find(c,e)
      e=e if s<0 else s+len(c)
      z.append((0,0) if s<0 else (s,e))
    return [normalized_string[s:e] for s,e in z if e>0]
  def pre_tokenize(self,pretok):
    pretok.split(self.mecab_split)
class BertMecabTokenizerFast(BertTokenizerFast):
  def __init__(self,vocab_file,do_lower_case=False,tokenize_chinese_chars=False,**kwargs):
    from tokenizers.pre_tokenizers import PreTokenizer,BertPreTokenizer,Sequence
    super().__init__(vocab_file=vocab_file,do_lower_case=do_lower_case,tokenize_chinese_chars=tokenize_chinese_chars,**kwargs)
    d=kwargs["mecab_kwargs"] if "mecab_kwargs" in kwargs else {"mecab_dic":"ipadic"}
    self._tokenizer.pre_tokenizer=Sequence([PreTokenizer.custom(MecabPreTokenizer(**d)),BertPreTokenizer()])

and now BertMecabTokenizerFast tokenizes "平気" into "平" and "##気". See detail in my diary written in Japanese.