Closed apohllo closed 4 years ago
Yes, we plan to publish the article in a month or so. However, I don't expect the details of training to be particularly suprising or novel to anyone who follows the research on transformer architectures in English :)
When it comes to language model pre-training, most of our corpus comes from CommonCrawl, but we don't just use the raw text extracted from WARC. To produce a quality corpus, we do some heavy pre-processing and cleaning that includes:
In the case of LM fine-tuning, the only non-trivial "trick" is to use a simple resampling technique for highly imbalanced datasets (CBD, DYK, PSC) to counter that imbalance. Samples for the minority class in the training set are duplicated and/or some samples for the majority class are randomly discarded (see resample
parameter in the Evaluation section of README
).
About language model pre-training, I assume "batch size" refers to a number of samples and "update steps" is a number of batches. What hardware have you used? 30k batch size was achieved with many GPUs/TPUs or a gradient accumulation? How many times (epochs) whole dataset was used for training?
We used 8 x Nvidia V100, the batch size was achieved with gradient accumulation. In the case of large model, the training reached 25 epochs. I don't remember the specific number for the base model, but it was close to 200 epochs.
Thanks.
The same parameters was used for the large model fine-tuning?
Do you plan to publish the details of the training process? The results are excellent and it would be very beneficial for the research community to know the details of training.