google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.98k stars 9.58k forks source link

fine tuning on chinese dataset getting unwell result #140

Closed voidful closed 5 years ago

voidful commented 5 years ago

i tried to train a mc model on Chinese dataset on traditional and simplified Chinese in word and char level both multilang-bert and Chinese-bert but all of the experiment getting not very well result

compare to BIDAF - speed and accuracy in bert, i got 0 em for both word and char level ~60% f1 in word level ~4% f1 in char level

here are my config : train_batch_size 12 learning_rate 3e-5 num_train_epochs 1.0 max_seq_length 512 doc_stride 128

due to the length of sequence in Chinese will longer then in English it might not doing well in char level chinese ?

songtaoshi commented 5 years ago

hello, could you show me ur fine-tuning code and ur data format. i am also fine-tuning on a chinese dataset

voidful commented 5 years ago

i use pytorch version for fine-tuning - https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_squad.py

and data format is just like SQuAD 1.1 {"title": "梵文", "id": "1147", "paragraphs": [{"context": "在歐洲,梵語的學術研究,由德國學者陸特和漢斯雷頓開創。後來威廉·瓊斯發現印歐語系,也要歸功於對梵語的研究。此外,梵語研究,也對西方文字學及歷史語言學的發展,貢獻不少。1786年2月2日,亞洲協會在加爾各答舉行。會中,威廉·瓊斯發表了下面這段著名的言論:「梵語儘管非常古老,構造卻精妙絕倫:比希臘語還完美,比拉丁語還豐富,精緻之處同時勝過此兩者,但在動詞詞根和語法形式上,又跟此兩者無比相似,不可能是巧合的結果。這三種語言太相似了,使任何同時稽考三者的語文學家都不得不相信三者同出一源,出自一種可能已經消逝的語言。基於相似的原因,儘管缺少同樣有力的證據,我們可以推想哥德語和凱爾特語,雖然混入了迥然不同的語彙,也與梵語有著相同的起源;而古波斯語可能也是這一語系的子裔。」", "id": "1147-5", "qas": [{"id": "1147-5-1", "question": "陸特和漢斯雷頓開創了哪一地區對梵語的學術研究?", "answers": [{"id": "1", "text": "歐洲", "answer_start": 1}, {"id": "2", "text": "歐洲", "answer_start": 1}]},

wykdg commented 5 years ago

My experimental results in Chinese have not improved,too。Maybe we can build a discussion group to discuss it in depth.

monanahe commented 5 years ago

I get good results on several chinese datasets, including industry classification , emotion classification and multi-choice Question Answer. Currently I am trying bert features + other models.

Please add my wechat: teardrops123 I want to join yours >0<

voidful commented 5 years ago

i have created a groups : https://groups.google.com/d/forum/bert_mc_chinese so that we can discuss more in depth ~

ZizhenWang commented 5 years ago

I fine-tuned the model on a simplified Chinese MRC dataset, and reach the SOTA, with default hyper-parameters. The worse result may comes from the tradition Chinese corpus?

I also find bigger batch_size leads to better result.

crapthings commented 5 years ago

哪位朋友可以介绍一下用例哦~

sohuren commented 5 years ago

which chinese MRC dataset you used?

On Mon, Nov 19, 2018 at 5:49 AM zhen notifications@github.com wrote:

I fine-tuned the model on a simplified Chinese MRC dataset, and reach the SOTA, with default hyper-parameters. The worse result may comes from the tradition Chinese corpus?

I also find bigger batch_size leads to better result.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/140#issuecomment-439898621, or mute the thread https://github.com/notifications/unsubscribe-auth/AJrfoo_RuvHBd_1Si9oV3yLekB_Z0MJSks5uwrb3gaJpZM4Yn6Zn .

voidful commented 5 years ago

in my dataset there is lots of unknown char such as '小 金 毛 [UNK] 並' in '小 金 毛 鼴 並' moreover i found that : Because Chinese does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece. This means that Chinese is effectively character-tokenized. Note that the CJK Unicode block only includes Chinese-origin characters and does not include Hangul Korean or Katakana/Hiragana Japanese, which are tokenized with whitespace+WordPiece like all other languages. and this one : https://github.com/google-research/bert/issues/66

maybe that's the reason why bert still have work to do seems that ELMo is a better choice for now

maksna commented 5 years ago

I cant fine tuning for chinese NER task.Can somebody help me?

monanahe commented 5 years ago

I cant fine tuning for chinese NER task.Can somebody help me?

You can add my wechat. I have a friend who is fine tuning NER task too.

songtaoshi commented 5 years ago

@voidful hello voidful, I am still confused about why we should add white space around the chinese character in the tokenization. It seems that in the original code, it add space , and then strip. I am really confused about such thing.

voidful commented 5 years ago

@songtaoshi bert use character level to tokenize chinese sentences it tokenize per character

TonTonTWMAN commented 5 years ago

num_train_epochs 1.0 可以試著新增這個參數,根據我自己跑過一些自建 SQUAD 格式的數據的實驗來看,增加這個數字會有幫助的!

voidful commented 5 years ago

num_train_epochs 1.0 可以試著新增這個參數,根據我自己跑過一些自建 SQUAD 格式的數據的實驗來看,增加這個數字會有幫助的!

not working QQ

python run_squad.py \ --vocab_file $BERT_BASE_DIR/vocab.txt \ --bert_config_file $BERT_BASE_DIR/bert_config.json \ --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin \ --do_train \ --do_predict \ --do_lower_case \ --train_file $SQUAD_DIR/train_zh_char_seg.json \ --predict_file $SQUAD_DIR/dev_zh_char_seg.json \ --train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 1.0 \ --max_seq_length 512 \ --doc_stride 128 \ --max_query_length 256 \ --output_dir ../verbose_squad/ \ --fp16 \ --gradient_accumulation_steps 8 \ --verbose_logging \ --optimize_on_cpu

TonTonTWMAN commented 5 years ago

我沒跑過 pytorch,我是用 bert 的 tf 跟 中文的 bert-base 然後直接 fine-tune 且用自己做的一些中文 SQUAD 格式的數據跑的 !

python run_squad.py \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --do_train=True \ --train_file=$SQUAD_DIR/train-v1.1.json \ --do_predict=True \ --predict_file=$SQUAD_DIR/dev-v1.1.json \ --train_batch_size=12 \ --learning_rate=3e-5 \ --num_train_epochs=2.0 \ <--- 我的意思是修改這個參數的數字,比如改為 5.0 或者其它 ! --max_seq_length=384 \ --doc_stride=128 \ --output_dir=/tmp/squad_base/

BloodD commented 5 years ago

为什么都要说英文?

ArthurRizar commented 5 years ago

i think Chinese model does not work,

industry

fine tuning or fixed feature? i got a worse result on a multi-class dataset, but i also got a good reuslt on a binary-class dataset

voidful commented 5 years ago

我沒跑過 pytorch,我是用 bert 的 tf 跟 中文的 bert-base 然後直接 fine-tune 且用自己做的一些中文 SQUAD 格式的數據跑的 !

python run_squad.py --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --do_train=True --train_file=$SQUAD_DIR/train-v1.1.json --do_predict=True --predict_file=$SQUAD_DIR/dev-v1.1.json --train_batch_size=12 --learning_rate=3e-5 --num_train_epochs=2.0 \ <--- 我的意思是修改這個參數的數字,比如改為 5.0 或者其它 ! --max_seq_length=384 --doc_stride=128 --output_dir=/tmp/squad_base/

i finally got gpu's to try more epochs got no improvement...... i tried up to 50

qiu-nian commented 5 years ago

我沒跑過 pytorch,我是用 bert 的 tf 跟 中文的 bert-base 然後直接 fine-tune 且用自己做的一些中文 SQUAD 格式的數據跑的 !

python run_squad.py --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --do_train=True --train_file=$SQUAD_DIR/train-v1.1.json --do_predict=True --predict_file=$SQUAD_DIR/dev-v1.1.json --train_batch_size=12 --learning_rate=3e-5 --num_train_epochs=2.0 \ <--- 我的意思是修改這個參數的數字,比如改為 5.0 或者其它 ! --max_seq_length=384 --doc_stride=128 --output_dir=/tmp/squad_base/

@ideex2 您好,直接在运行时加--num_train_epochs这个参数就可以了吗,代码里面需要做其他的处理吗,我也是用的tf版的。

mashagua commented 5 years ago

I cant fine tuning for chinese NER task.Can somebody help me?

You can add my wechat. I have a friend who is fine tuning NER task too.

your wechat is ???

voidful commented 5 years ago

Now i can get around 80% F1 after 1 epoch~ I seems that new update already fix that issue

geekboood commented 5 years ago

@voidful I encounter the same problem. How did you fix it?

voidful commented 5 years ago

@voidful I encounter the same problem. How did you fix it?

qiu-nian commented 5 years ago

@voidful I encounter the same problem. How did you fix it?

  • update to newest version
  • make sure gradient not overflow
  • use char level

@voidful hello, Is the character level better than the word level? Will this influence be great?

voidful commented 5 years ago

@voidful I encounter the same problem. How did you fix it?

  • update to newest version
  • make sure gradient not overflow
  • use char level

@voidful hello, Is the character level better than the word level? Will this influence be great?

According to this, bert use char level on chinese
Intuitively bert will perform better in char level https://github.com/google-research/bert/blob/master/multilingual.md#tokenization

If you want it more sensitive on word, or wan't to train it on word level you can take a look at this: https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

qiu-nian commented 5 years ago

@voidful I encounter the same problem. How did you fix it?

  • update to newest version
  • make sure gradient not overflow
  • use char level

@voidful hello, Is the character level better than the word level? Will this influence be great?

According to this, bert use char level on chinese Intuitively bert will perform better in char level https://github.com/google-research/bert/blob/master/multilingual.md#tokenization

If you want it more sensitive on word, or wan't to train it on word level you can take a look at this: https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

@voidful
Thank you for your reply. I have noticed Baidu's work before, but because of the framework, I gave up. Maybe I should look at this work again.

WenTingTseng commented 4 years ago

@voidful where can download chinese QA corpus and its format like SQuAD1.1