BERT-Base Chinese data details

htw2012 commented 5 years ago

Hi, I have some questions about the detail of Chinese BERT-Base model.

Is the model trained base on entire Chinese wikipedia raw text ?
Are there additional pre-processing steps for raw corpus?
How many (lines) sentences in the pre-training samples ？
How long did it take for you to finish the pre-training process?
In addition, if we have a large domain-specific corpus , we can train the pre-training model as follows: a) train with task-specific corpus. b) train with task-specific corpus and general corpus such as wikipedia. which way is better?

Thank you in advance!

jacobdevlin-google commented 5 years ago

Yes, a processed version of that which only keeps the text portions without formatting. In both traditional and simplified.
Pre-processed to remove tables/images/formatting.
25M sentences.
It was done using Google's parallel processing so only a few minutes. Probably a few hours if done on a single machine.
It depends on how big it is. The best approach will probably be to run pre-training first on Wikipedia and then for more epochs on only your corpus. Or even better, to use the models we released and then to run pre-training for more steps (unless you want to do everything from scratch).

chenjiasheng commented 5 years ago

I have some additional questions.

How long would it take to pre-train 100M sentences (each with length 1~127 Chinese characters) from scratch on a horovod cluster with 8*8 V100 GPUs?
Is it possible to accelerate the training speed, by grouping sentences in the way that seq_length is the same inside a batch while differs between different batches, and making the model accept different seq_length?
Should I remove the CLS token and segment_ids if my purpose is just a language model?

Thank you!

jacobdevlin-google commented 5 years ago

Not sure, I've never trained on GPUs.
I would recommend packing multiple sentences until (approximately) the max sequence length, which is what create_pretraining_data.py already does.
It doesn't hurt to include them in case you might want to use the model for other stuff, but if you only care about predicting missing words then it probably doesn't matter. But keep in mind that BERT doesn't give you a true "language model", it just allows you to predict single missing wordpieces.

Continue7777 commented 5 years ago

I have some additional questions. 1.Wether the bert method will work on other model,just like text-rnn or something else. 2.Will it help,if we pretrain the task on dataset we use to finetune? (sentence about 5M) 3.In my short_text_classification task ,max_sequence_length is only 20,can i use bert chinese?

chenjiasheng commented 5 years ago

@Continue7777 I think the answers to all you three questions is YES. I also pruned the vocab from 22k+ to 6500, with top used Chinese characters only. You can check my fork.

eric-haibin-lin commented 5 years ago

@jacobdevlin-google Is any special preprocessing for BookCorpus? For example, removing TOCs? Also, is a book treated as a document, or is every chapter treated as a document?

y111x commented 5 years ago

I have some additional questions.

How long would it take to pre-train 100M sentences (each with length 1~127 Chinese characters) from scratch on a horovod cluster with 8*8 V100 GPUs?

Is it possible to accelerate the training speed, by grouping sentences in the way that seq_length is the same inside a batch while differs between different batches, and making the model accept different seq_length?

Should I remove the CLS token and segment_ids if my purpose is just a language model?

Thank you!

hello，I have the same question .Have you try to pre-train 100M Chinese sentences from scratch use 8*8 GPUS? Could you tell me about the training time？ Thank you

chenjiasheng commented 5 years ago

@yyx911216 Not yet. Being busy with the acoustic model. Welcome and pleased to share the future info.