Closed htw2012 closed 5 years ago
I have some additional questions.
seq_length
is the same inside a batch while differs between different batches, and making the model accept different seq_length
? CLS
token and segment_ids
if my purpose is just a language model?Thank you!
create_pretraining_data.py
already does.I have some additional questions. 1.Wether the bert method will work on other model,just like text-rnn or something else. 2.Will it help,if we pretrain the task on dataset we use to finetune? (sentence about 5M) 3.In my short_text_classification task ,max_sequence_length is only 20,can i use bert chinese?
@Continue7777 I think the answers to all you three questions is YES. I also pruned the vocab from 22k+ to 6500, with top used Chinese characters only. You can check my fork.
@jacobdevlin-google Is any special preprocessing for BookCorpus? For example, removing TOCs? Also, is a book treated as a document, or is every chapter treated as a document?
I have some additional questions.
- How long would it take to pre-train 100M sentences (each with length 1~127 Chinese characters) from scratch on a horovod cluster with 8*8 V100 GPUs?
- Is it possible to accelerate the training speed, by grouping sentences in the way that
seq_length
is the same inside a batch while differs between different batches, and making the model accept differentseq_length
?- Should I remove the
CLS
token andsegment_ids
if my purpose is just a language model?Thank you!
hello,I have the same question .Have you try to pre-train 100M Chinese sentences from scratch use 8*8 GPUS? Could you tell me about the training time? Thank you
@yyx911216 Not yet. Being busy with the acoustic model. Welcome and pleased to share the future info.
@jacobdevlin-google One question about wiki Chinese preprocessing: using the wiki Chinese dump, I got 12.5M lines(sentences) after pre-processing. However, the above post said you got 25M lines. Can you let me know what's wrong with my steps?
What I did:
@LiweiPeng what script are you using for sentence segmentation? That might lead to different number of sentences
@eric-haibin-lin I used something very similar to https://blog.csdn.net/blmoistawinde/article/details/82379256. I added some extra like ';' as sentence token.
I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.
I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.
@LiweiPeng Hi, did you successfully reproduce the result of Google bert? I was trying to do so but what I pretrained is 1-3 points less than Google bert.
I ran pretrain several times with difference parameters. the best result I got has XNLI 77.0, very close to the published Google result.
I ran pretrain several times with difference parameters. the best result I got has XNLI 77.0, very close to the published Google result.
@LiweiPeng Thanks for your reply. I will appreciate if you could tell me which parameters you changed in your experiment. I would like to have a try.
The parameters I adjusted are batch size and learning rate. The recent Reducing BERT Pre-Training Time from 3 Days to 76 Minutes paper has a good research on this topic: https://arxiv.org/abs/1904.00962
The parameters I adjusted are batch size and learning rate. The recent Reducing BERT Pre-Training Time from 3 Days to 76 Minutes paper has a good research on this topic: https://arxiv.org/abs/1904.00962
@LiweiPeng Thank you very much. I have read that paper. May I know the batch size and learning rate of the best model you trained?
The batch size I used is 2304. learning rate=2.4e-4. I used 16 V100 GPU and trained for 400k steps.
The batch size I used is 2304. learning rate=2.4e-4. I used 16 V100 GPU and trained for 400k steps.
@LiweiPeng Thank you. BTW, which delimiters did you use to split the wiki text (after WikiExtract processing) into sentences? I used "re.split('([;|\;|。|!|!|?|\?|;])',line)" but could only get 11.4M lines. I found that the final files contains both simplified and tranditional Chinese and thus, this is not caused by the problem you met before.
@ItachiUchihaVictor I'm also confused about the number of sentences, have you figure it out?
@jacobdevlin-google One question about wiki Chinese preprocessing: using the wiki Chinese dump, I got 12.5M lines(sentences) after pre-processing. However, the above post said you got 25M lines. Can you let me know what's wrong with my steps?
What I did:
- download https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
- use WikiExtractor.py to extract the paragraphs. I got total 14.7M lines.
- use a script to split the paragraphs into sentences. Ignore empty lines in wiki extracts. Add new line for wiki only. I got total 12.5M lines.
I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.
@LiweiPeng
Does it mean that Pre-training of BERT of Chinese Version only uses wiki corpus (not use BookCorpus)?
Hi, I have some questions about the detail of Chinese BERT-Base model.
Thank you in advance!