Closed jbrry closed 3 years ago
Launched the following filtering configurations for all corpora:
# 0
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type None
# 1
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type document-heuristic
# 2
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type basic
# 3
python external_scripts/gather_external_data.py --datasets conll17 gdrive NCI oscar paracrawl --input-type processed --filter-type basic+char-@+lang-@ --char-filter-threshold 1.0 --lang-filter-threshold 0.8
use the WordPiece tokenizer directly, as opposed to the SentencePiece
I'd leave it as a separate experiment.
If anyone can think of other models to run, please add them below!
Further experiments could explore:
--random
, --minimum
and --wrap
parameters of ./split_tokenised_text_into_sentences.py
)There is no need to run all possible combinations. However, when picking the baseline setting in each experiment, we need to consider the presentation in the paper. If will be messy to have different baselines. Picking the current best setting as the baseline would make the results heavily depend on the order in which we carry out each experiment and would require us to describe this order in the paper. Better to decide on a single baseline based on our experiments so far and then stick with it.
You didn't say whether these new experiments are done with BERT or ELECTRA. Given that good choices for corpus selection, filtering, vocab size etc. for ELECTRA should also be good for BERT, I think in the interest of time and co2 footprint we should carry out any new experiments with ELECTRA, setting the number of steps such that it uses much less GPU time than BERT with the 500k or 1M setting but still performs well in the dependency parsing task.
I should note that my experiments with mBERT for the tri-training paper suggest that dependency parsing doesn't go very deep in the BERT representations and may therefore be not a great measure of the quality of the representation: I get much better results when using the representations from layer 8 or 9 (of 12) than when using the final 12th layer or using the average of the last 4 layers as Straka et al. 2019 did.
These experiments are being run with BERT rather than ELECTRA. @jbrry I thought that we needed the TPU to use ELECTRA?
Also, an early motivation of the work is to compare gaBERT with mBERT so it makes sense to run these ablation experiments with gaBERT, and then in a separate experiment directly compare gaBERT (our best setting) with ELECTRA for the same setting.
Interesting point about the layer. But this is in the setting where you use the BERT representations as input to the parser and don't fine-tune BERT, no? Are you saying that other fine-tuning tasks tell us more about which model is "better" for Irish?
I'd leave it as a separate experiment.
I have already updated the pipeline to use HuggingFace tokenizer directly (https://github.com/jbrry/wiki-bert-pipeline/commit/48e5d8ba3593b4f18c38a065820c7654e15a97d4 and https://github.com/jbrry/wiki-bert-pipeline/commit/ab48bced1c09ce254c9d6cc9167460d0a65512b3), but I can compare it to last week's run with the conll17 bug fix and revert if it is worse. My expectation is that they should not be that much different, in which case, switching to HuggingFace Tokenizers would be preferred because it is cleaner to describe and should have better interoperability with other pipelines in the future (and shouldn't have those very few erroneous entries which you mentioned).
We could work under the assumption that more data is better and include all corpora. The goal of the filters would then be to screen out the noisy data. This would obviate the need to carry out tests on the effect of each corpus. I can run the 4 filtering configurations mentioned above. We could then use this as our "baseline", e.g. find the best filter configuration then use that setting for all other more fine-grained experiments.
If instead of using the filter to decide what texts to use, we could manually determine which ones are helpful by adding a corpus each run, e.g.:
Considering other combinations may be too time-consuming. If we think that the filters should decide what's included we could skip most of these except our gold-standard NCI corpus (as a side experiment to see how well a clean corpus can do by itself). But yes, seeing how much each corpus adds or detracts would be interesting regardless.
Try sizes: 10k, 20k, 30k etc.
Yes, it would be interesting to see the role of sentence-splitting, e.g. does BERT need to see properly formed sentences or are snippets of text sufficient (for our tasks)?
@fosterjen @jowagner Yes, so far these are going to be run with ga_BERT (the pipeline is not set up to work with ELECTRA just yet, but it wouldn't take too much work to also prepare the pretraining data for ELECTRA). Good point about early motivation to compare with mBERT. @fosterjen I haven't tried ELECTRA on GPU yet but it should not be a problem (it will just use smaller hyperparameters than on the TPU).
Interesting point about the layer. But this is in the setting where you use the BERT representations as input to the parser and don't fine-tune BERT, no?
Yes, my parser does not fine-tune mBERT but extracts an input representation from the selected layers as is. It's quite possible that fine-tuning BERT for dependency parsing changes the top layers in such a way as to make the information from the lower layers available to the parser and also to outsource some of the processing that normally happens in the Bi-LSTM layers of the parser to the top layers of BERT. A way to test this would be to re-initialise the parameters of the top 3 or so layers randomly, freeze the BERT layers that have not been re-initialised, train the top layers on the fine-tuning task, unfreeze BERT and fine-tune all layers as usual. If this performs just as well as the standard procedure this would mean that the information in the top layers from pre-training does not contribute anything to the final task.
Are you saying that other fine-tuning tasks tell us more about which model is "better" for Irish?
Ideally, we should look at multiple tasks during development and then test the final model on even more tasks. Using just 1 task in development can be ok if the task is a good proxy for measuring model quality. Dependency parsing LAS sounds like a good proxy to me. Of course, it is also a question of how much work it is and how easy the work will be to understand for our audience.
If the top layers of BERT do contain useful information for other tasks but dependency parsing LAS does not capture the quality of this information as it is too low level it may be useful to include a higher level task such as NLI or QA. Do we have test sets for such a task?
Another possibility is that it is normal for the top layers not to be optimal for tasks without fine-tuning and nobody thought about just getting rid of those layers before sharing models to make the download size ~25% smaller and telling people to add some fresh layers needed for their task(s) themselves.
https://github.com/jbrry/Irish-BERT/issues/62#issuecomment-832077172 suggests that the new vocab throws out many Irish words and subword units to make space for a large set of foreign letters, English ordinal numbers and emoticons. A WordPiece vs SentencePiece experiment with all other parameters fixed will be useful.
Edit: typo
I made a backup of last week's directory which includes the conll17 tokeniser fix and deleted all files created after the corpus filtering step and re-ran the pipeline. This means the vocab file will now be generated by the WordPiece tokenizer and the pre-training data will be created using this vocab file but the input sentences will be the exact same. The BERT model for this configuration is at checkpoint 311/500 so I should have the results of this experiment fairly soon.
In today's meeting we decided that the cross-product of all settings would create too many experiments to have them all ready with enough time for analysis of the final model. We will make decisions step by step instead:
Update: Given time constraints and the ability to run jobs in parallel, we combine steps 3 and 4:
If we can get more than 3 GPUs on the night of the 27th, also include:
Then, if time permits, e.g. if using electra 24h/12h, we can decide what else to include. Here a list (not discussed in the meeting):
We aim to finish development in 8 days, i.e. on the 21st of May 1st of June (push out for EUD shared task submission), to start training the final bert-512 model.
Recent e-mails suggest that step 5 has been moved to before the NCI and WordPiece experiments and that there is a further delay in completing the experiment.
In the meeting today, we discussed that a final version of the model should not contain a bug such as the conll17 tokenisation bug (#66). A subsequent experiment was carried out which showed that the bug had only a minor impact on LAS/UPOS/FEATS scores see results here but this tokenisation bug may harm the Cloze test experiments.
As such, we are going to do a fresh re-run of many of the models. For experimental purposes, we will run for 500k steps with ga_BERT as that checkpoint achieves similar scores to the final 1M model.
The models we are going to run include:
I think the wiki-bert-pipeline should be adjusted to use the WordPiece tokenizer directly, as opposed to the SentencePiece - WordPiece conversion. Or else this could be a separate run in itself (e.g. comparing the two tokenizers).
If anyone can think of other models to run, please add them below!