Models we trained for summer 2021 (was: New models to be run)

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Models we trained for summer 2021 (was: New models to be run) #72

Closed jbrry closed 3 years ago

jbrry commented 3 years ago

In the meeting today, we discussed that a final version of the model should not contain a bug such as the conll17 tokenisation bug (#66). A subsequent experiment was carried out which showed that the bug had only a minor impact on LAS/UPOS/FEATS scores see results here but this tokenisation bug may harm the Cloze test experiments.

As such, we are going to do a fresh re-run of many of the models. For experimental purposes, we will run for 500k steps with ga_BERT as that checkpoint achieves similar scores to the final 1M model.

The models we are going to run include:

All corpora with 4 different filtering settings.
NCI-only

I think the wiki-bert-pipeline should be adjusted to use the WordPiece tokenizer directly, as opposed to the SentencePiece - WordPiece conversion. Or else this could be a separate run in itself (e.g. comparing the two tokenizers).

If anyone can think of other models to run, please add them below!

jbrry commented 3 years ago

Launched the following filtering configurations for all corpora:

# 0
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type None

# 1
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type document-heuristic

# 2
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type basic

# 3
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type basic+char-@+lang-@ --char-filter-threshold 1.0 --lang-filter-threshold 0.8

jowagner commented 3 years ago

use the WordPiece tokenizer directly, as opposed to the SentencePiece

I'd leave it as a separate experiment.

If anyone can think of other models to run, please add them below!

Further experiments could explore:

the effect of the size of the vocabulary
the effect of each corpus: ablation, individual corpora, and, if going for a more fine-grained analysis with a high number of corpora, data Shapley (Ghorbani and Zou 2019)
the effect of random initialisation
the (un-)importance of good sentence splitting (controlling the noise with --random, --minimum and --wrap parameters of ./split_tokenised_text_into_sentences.py)
the effect of the language filter threshold (try more thresholds)
the effect of tokenisation (given that BERT aggressively splits at any non-alphanumeric characters and that Irish tokenisation does not change sequences of alphanumeric characters tokenisation should not matter other than for whole word mask selection and sentence splitting)

There is no need to run all possible combinations. However, when picking the baseline setting in each experiment, we need to consider the presentation in the paper. If will be messy to have different baselines. Picking the current best setting as the baseline would make the results heavily depend on the order in which we carry out each experiment and would require us to describe this order in the paper. Better to decide on a single baseline based on our experiments so far and then stick with it.

You didn't say whether these new experiments are done with BERT or ELECTRA. Given that good choices for corpus selection, filtering, vocab size etc. for ELECTRA should also be good for BERT, I think in the interest of time and co2 footprint we should carry out any new experiments with ELECTRA, setting the number of steps such that it uses much less GPU time than BERT with the 500k or 1M setting but still performs well in the dependency parsing task.

I should note that my experiments with mBERT for the tri-training paper suggest that dependency parsing doesn't go very deep in the BERT representations and may therefore be not a great measure of the quality of the representation: I get much better results when using the representations from layer 8 or 9 (of 12) than when using the final 12th layer or using the average of the last 4 layers as Straka et al. 2019 did.

fosterjen commented 3 years ago

These experiments are being run with BERT rather than ELECTRA. @jbrry I thought that we needed the TPU to use ELECTRA?

Also, an early motivation of the work is to compare gaBERT with mBERT so it makes sense to run these ablation experiments with gaBERT, and then in a separate experiment directly compare gaBERT (our best setting) with ELECTRA for the same setting.

Interesting point about the layer. But this is in the setting where you use the BERT representations as input to the parser and don't fine-tune BERT, no? Are you saying that other fine-tuning tasks tell us more about which model is "better" for Irish?

jbrry commented 3 years ago

Tokenizer

I'd leave it as a separate experiment.

I have already updated the pipeline to use HuggingFace tokenizer directly (https://github.com/jbrry/wiki-bert-pipeline/commit/48e5d8ba3593b4f18c38a065820c7654e15a97d4 and https://github.com/jbrry/wiki-bert-pipeline/commit/ab48bced1c09ce254c9d6cc9167460d0a65512b3), but I can compare it to last week's run with the conll17 bug fix and revert if it is worse. My expectation is that they should not be that much different, in which case, switching to HuggingFace Tokenizers would be preferred because it is cleaner to describe and should have better interoperability with other pipelines in the future (and shouldn't have those very few erroneous entries which you mentioned).

Filtering

We could work under the assumption that more data is better and include all corpora. The goal of the filters would then be to screen out the noisy data. This would obviate the need to carry out tests on the effect of each corpus. I can run the 4 filtering configurations mentioned above. We could then use this as our "baseline", e.g. find the best filter configuration then use that setting for all other more fine-grained experiments.

Corpora selection

If instead of using the filter to decide what texts to use, we could manually determine which ones are helpful by adding a corpus each run, e.g.:

NCI
NCI + Gdrive
NCI + Gdrive + Wiki
NCI + Gdrive + Wiki + OSCAR
NCI + Gdrive + Wiki + OSCAR + CoNLL17
NCI + Gdrive + Wiki + OSCAR + CoNLL17 + ParaCrawl

Considering other combinations may be too time-consuming. If we think that the filters should decide what's included we could skip most of these except our gold-standard NCI corpus (as a side experiment to see how well a clean corpus can do by itself). But yes, seeing how much each corpus adds or detracts would be interesting regardless.

Vocabulary

Try sizes: 10k, 20k, 30k etc.

Sentence-splitting noise

Yes, it would be interesting to see the role of sentence-splitting, e.g. does BERT need to see properly formed sentences or are snippets of text sufficient (for our tasks)?

ELECTRA/BERT

@fosterjen @jowagner Yes, so far these are going to be run with ga_BERT (the pipeline is not set up to work with ELECTRA just yet, but it wouldn't take too much work to also prepare the pretraining data for ELECTRA). Good point about early motivation to compare with mBERT. @fosterjen I haven't tried ELECTRA on GPU yet but it should not be a problem (it will just use smaller hyperparameters than on the TPU).

jowagner commented 3 years ago

Interesting point about the layer. But this is in the setting where you use the BERT representations as input to the parser and don't fine-tune BERT, no?

Yes, my parser does not fine-tune mBERT but extracts an input representation from the selected layers as is. It's quite possible that fine-tuning BERT for dependency parsing changes the top layers in such a way as to make the information from the lower layers available to the parser and also to outsource some of the processing that normally happens in the Bi-LSTM layers of the parser to the top layers of BERT. A way to test this would be to re-initialise the parameters of the top 3 or so layers randomly, freeze the BERT layers that have not been re-initialised, train the top layers on the fine-tuning task, unfreeze BERT and fine-tune all layers as usual. If this performs just as well as the standard procedure this would mean that the information in the top layers from pre-training does not contribute anything to the final task.

Are you saying that other fine-tuning tasks tell us more about which model is "better" for Irish?

Ideally, we should look at multiple tasks during development and then test the final model on even more tasks. Using just 1 task in development can be ok if the task is a good proxy for measuring model quality. Dependency parsing LAS sounds like a good proxy to me. Of course, it is also a question of how much work it is and how easy the work will be to understand for our audience.

If the top layers of BERT do contain useful information for other tasks but dependency parsing LAS does not capture the quality of this information as it is too low level it may be useful to include a higher level task such as NLI or QA. Do we have test sets for such a task?

Another possibility is that it is normal for the top layers not to be optimal for tasks without fine-tuning and nobody thought about just getting rid of those layers before sharing models to make the download size ~25% smaller and telling people to add some fresh layers needed for their task(s) themselves.

jowagner commented 3 years ago

https://github.com/jbrry/Irish-BERT/issues/62#issuecomment-832077172 suggests that the new vocab throws out many Irish words and subword units to make space for a large set of foreign letters, English ordinal numbers and emoticons. A WordPiece vs SentencePiece experiment with all other parameters fixed will be useful.

Edit: typo

jbrry commented 3 years ago

I made a backup of last week's directory which includes the conll17 tokeniser fix and deleted all files created after the corpus filtering step and re-ran the pipeline. This means the vocab file will now be generated by the WordPiece tokenizer and the pre-training data will be created using this vocab file but the input sentences will be the exact same. The BERT model for this configuration is at checkpoint 311/500 so I should have the results of this experiment fairly soon.

jowagner commented 3 years ago

In today's meeting we decided that the cross-product of all settings would create too many experiments to have them all ready with enough time for analysis of the final model. We will make decisions step by step instead:

Should development be carried out with Electra? See issue #76 for discussion.
Filter settings
Effect of removing NCI
Effect of using WordPiece instead of SentencePiece, effect of using the union of the two vocabularies (issue #74)
Effect of vocabulary size (Seamus found his MT models to be highly sensitive to the transformer vocabulary size)

Update: Given time constraints and the ability to run jobs in parallel, we combine steps 3 and 4:

without NCI
with WordPiece instead of SentencePiece
with WordPiece and without NCI

If we can get more than 3 GPUs on the night of the 27th, also include:

with union of WordPiece and SentencePiece vocabulary

Then, if time permits, e.g. if using electra 24h/12h, we can decide what else to include. Here a list (not discussed in the meeting):

List moved to issue #86
Other experiments suggested above

We aim to finish development in 8 days, i.e. on the ~~21st of May~~ 1st of June (push out for EUD shared task submission), to start training the final bert-512 model.

jowagner commented 3 years ago

Recent e-mails suggest that step 5 has been moved to before the NCI and WordPiece experiments and that there is a further delay in completing the experiment.