huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132k stars 26.29k forks source link

It is better to add a function to train additon tokens for the pre_trained tokenizer. esp. for the language like Chinese. #15153

Closed zhangbo2008 closed 2 years ago

zhangbo2008 commented 2 years ago

https://github.com/huggingface/transformers/blob/96881729ce83cfc8e5fa04c903ee4296ad17cfbb/src/transformers/models/bert/tokenization_bert.py#L117

Lately, I use bert to train a NER model for Chinese. I found many Chinese Charaters in my data can not be tokenize by the bert model. Literally, it is tokened to [PAD], and can not give a definitely word embedding vector. So it is better to add a function to the class tokenziation , and the funtion can train new tokenzation , finally can extend the old tokenizer. It is useful . I was trying .

zhangbo2008 commented 2 years ago

it use the source code for generating bert tokenizer , i am finding the src.

LysandreJik commented 2 years ago

Have you tried with the bert-base-chinese checkpoint?

Also cc @SaulLu :)

zhangbo2008 commented 2 years ago

Have you tried with the bert-base-chinese checkpoint?

Also cc @SaulLu :)

Sure , i have tryed some examle is 锶 is not in bert_vocab() .

i have use a dummy solution

tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-chinese-ner")

model = AutoModelForTokenClassification.from_pretrained("ckiplab/bert-base-chinese-ner")

def dummy_way_to_find_all_new_tokens_to_bert_tokenizer(fp,tokenizer):
    def _is_chinese_char(cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if (
            (cp >= 0x4E00 and cp <= 0x9FFF)
            or (cp >= 0x3400 and cp <= 0x4DBF)  #
            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
            or (cp >= 0xF900 and cp <= 0xFAFF)
            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
        ):  #
            return True

        return False
    out=[]
    with open(fp,encoding='utf-8') as f:
        for i in f:
            for j in i:
                if _is_chinese_char(ord(j)) and j not in tokenizer.get_vocab():
                    out.append(j)
    return out

    print(1)

tokenizer.add_tokens(dummy_way_to_find_all_new_tokens_to_bert_tokenizer('data/train1.txt',tokenizer)) #只能写一行,写一次.

model.resize_token_embeddings(len(tokenizer))
SaulLu commented 2 years ago

@zhangbo2008 , your technique to add tokens to a tokenizer already trained because you want to fine-tune a model using this tokenizer seems very good to me.

However, I'm not sure I understand your request / demand / question in this issue. :relaxed:

zhangbo2008 commented 2 years ago

@zhangbo2008 , your technique to add tokens to a tokenizer already trained because you want to fine-tune a model using this tokenizer seems very good to me.

However, I'm not sure I understand your request / demand / question in this issue. ☺️

yes, you get the point . but I think my method is dummy. there is a better way to solve the problem, because in some language the token after trained is not one character ,like English . so it's a better way to get the bigger vocab by fix the bpe algorithm i think. but i haven't find some bpe algorithm code .

SaulLu commented 2 years ago

Thank you, I understand your request a little better!

As you may have seen, transformers does not directly implement the algorithms that allow to train a tokenizer (and thus get the vocabulary, merge rules, etc). A method like train_new_from_iterator, actually uses a feature of the tokenizers library (which has the advantage to be much faster as the library is coded in RUST).

I opened an issue in this library to see if we can imagine to implement this kind of feature in tokenizers, it seems quite feasible for tokenization algorithms like BPE (for which an issue was already opened previously) but harder with tokenization algorithms like Unigram. It is therefore probably better to advance on this subject in the library tokenizers :relaxed:

hj5167 commented 2 years ago

hello,I wonder whether we can utilize the unused tokens in the tokenizer,because many tokenizer have saved many unused tokens,but I dont know how to implement it !Could anyone tell me how to to it

zhangbo2008 commented 2 years ago

i am trying to read and learn the rust code in tokenizers library. rust is difficult to use . It is hard to config for VS code to run rust code . perhaps there is no IDE like pycharm for RUST.

zhangbo2008 commented 2 years ago

I already give a solution by python. you can check this. https://github.com/zhangbo2008/bpe_algorithm_can_finetune_tokenizer

zhangbo2008 commented 2 years ago

here is the example, definetely you should download py_bpe from upeer url. I change some code from others..

import tqdm
from py_bpe import BpeTokenizer
from pathlib import Path
savepath = Path("penguin_of_doom.vocab")
corpus = """
    hi every1 im new!!!!!!! *holds up spork* my name is katy but u can call me t3h PeNgU1N oF d00m!!!!!!!! lol…as u can see im very random!!!! thats why i came here, 2 meet random ppl like me ^_^… im 13 years old (im mature 4 my age tho!!) i like 2 watch invader zim w/ my girlfreind (im bi if u dont like it deal w/it) its our favorite tv show!!! bcuz its SOOOO random!!!! shes random 2 of course but i want 2 meet more random ppl =) like they say the more the merrier!!!! lol…neways i hope 2 make alot of freinds here so give me lots of commentses!!!!
    DOOOOOMMMM!!!!!!!!!!!!!!!! <--- me bein random again ^_^ hehe…toodles!!!!!
    love and waffles,
    t3h PeNgU1N oF d00m
"""

learn_bpe_args = dict(
    vocab_size=1000,
    pairable_chars="a-zA-Z0-9",
)

bpet = BpeTokenizer.from_corpus(corpus, savepath, learn_bpe_args=learn_bpe_args)
unk_char = "%"
tokens = bpet.tokenize("t3h PeNgU1N oF d00m"+unk_char)
print(tokens)

finetune_corpus='''hi every1 im new sssdlaj ssdsajlfk ssdsafjkl的斯拉克福建烤老鼠大解放路卡啥的'''
token_before_finetune=bpet.encode(finetune_corpus)
print(token_before_finetune)#[22, 22, 22, 25, 23, 18, 0, 12, 22, 22, 123, 18, 0, 23, 28, 33, 12, 22, 22, 123, 220, 0, 33, 23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print('we see there are too many zero means unk')
#==========================adding code for extension: finetune a new tokenizer
the_factor_of_new_added_token_divided_unk_number=1.5 # we set this factor because, we expand our tokenizer, so the new corpus must be many unk with old tokenizer, our new_tokenizer length of need a new_added_token, so we set a factor, if the factor is higher, we have more new tokenizer. the factor must be bigger than 1.0
new_tokenizer=bpet.finetune_tokenizer(finetune_corpus,the_factor_of_new_added_token_divided_unk_number)

token_after_finetune=new_tokenizer.encode(finetune_corpus)
print(token_after_finetune)#[239, 240, 244, 223, 12, 239, 123, 241, 246, 33, 12, 239, 123, 220, 223, 254, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 224]
print("we see we have no unk for the token_after_finetune")
github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.