Publication with details of training?

apohllo commented 4 years ago

Do you plan to publish the details of the training process? The results are excellent and it would be very beneficial for the research community to know the details of training.

sdadas commented 4 years ago

Yes, we plan to publish the article in a month or so. However, I don't expect the details of training to be particularly suprising or novel to anyone who follows the research on transformer architectures in English :)

When it comes to language model pre-training, most of our corpus comes from CommonCrawl, but we don't just use the raw text extracted from WARC. To produce a quality corpus, we do some heavy pre-processing and cleaning that includes:

Extracting the "main content" of each page using Newspaper3k
Discarding all pages with short texts or texts containing forbidden words (such as "cookies", "javascript, "przeglądarka")
Computing perplexity of each text with a simple statistical language model (KenLM) and discarding all texts with large perplexity
Deduplicating the extracted texts

In the case of LM fine-tuning, the only non-trivial "trick" is to use a simple resampling technique for highly imbalanced datasets (CBD, DYK, PSC) to counter that imbalance. Samples for the minority class in the training set are duplicated and/or some samples for the majority class are randomly discarded (see resample parameter in the Evaluation section of README).

djstrong commented 4 years ago

About language model pre-training, I assume "batch size" refers to a number of samples and "update steps" is a number of batches. What hardware have you used? 30k batch size was achieved with many GPUs/TPUs or a gradient accumulation? How many times (epochs) whole dataset was used for training?

sdadas commented 4 years ago

We used 8 x Nvidia V100, the batch size was achieved with gradient accumulation. In the case of large model, the training reached 25 epochs. I don't remember the specific number for the base model, but it was close to 200 epochs.

apohllo commented 4 years ago

Thanks.

djstrong commented 4 years ago

The same parameters was used for the large model fine-tuning?

sdadas / polish-roberta

Publication with details of training? #1