sdadas / polish-roberta

RoBERTa models for Polish
GNU Lesser General Public License v3.0
86 stars 13 forks source link

Publication with details of training? #1

Closed apohllo closed 4 years ago

apohllo commented 4 years ago

Do you plan to publish the details of the training process? The results are excellent and it would be very beneficial for the research community to know the details of training.

sdadas commented 4 years ago

Yes, we plan to publish the article in a month or so. However, I don't expect the details of training to be particularly suprising or novel to anyone who follows the research on transformer architectures in English :)

When it comes to language model pre-training, most of our corpus comes from CommonCrawl, but we don't just use the raw text extracted from WARC. To produce a quality corpus, we do some heavy pre-processing and cleaning that includes:

In the case of LM fine-tuning, the only non-trivial "trick" is to use a simple resampling technique for highly imbalanced datasets (CBD, DYK, PSC) to counter that imbalance. Samples for the minority class in the training set are duplicated and/or some samples for the majority class are randomly discarded (see resample parameter in the Evaluation section of README).

djstrong commented 4 years ago

About language model pre-training, I assume "batch size" refers to a number of samples and "update steps" is a number of batches. What hardware have you used? 30k batch size was achieved with many GPUs/TPUs or a gradient accumulation? How many times (epochs) whole dataset was used for training?

sdadas commented 4 years ago

We used 8 x Nvidia V100, the batch size was achieved with gradient accumulation. In the case of large model, the training reached 25 epochs. I don't remember the specific number for the base model, but it was close to 200 epochs.

apohllo commented 4 years ago

Thanks.

djstrong commented 4 years ago

The same parameters was used for the large model fine-tuning?