Upgrade to Paracrawl v9

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Upgrade to Paracrawl v9 #77

Open jowagner opened 3 years ago

jowagner commented 3 years ago

Filtering has improved, including a filter trying to remove MT output (but still there are many easy to spot, for an Irish speaker, MT output)

jbrry commented 3 years ago

Update: We have decided to carry out the 4 filtering experiments on v7 and will apply the best setting when using v8.

jowagner commented 3 years ago

Do you mean before measuring the effect of leaving out the NCI? Did you discuss this with Jennifer? (Who is "we" in "We have decided" here?) Given the tight time frame I wouldn't change the agreed plan, especially not if electra cannot speed up the development phase, without top-level approval.

edit: typo opt --> top

jbrry commented 3 years ago

This would be before the leave-NCI-out experiment but would correspond to the 4 filtering experiments we decided to just use regular BERT for.

My response was based on our recent email thread:

I guess this depends how long it takes to re-run the filtering configurations. If it takes more than a day, my suggestion would be to start pre-training transformer models (bert or electra, see question above), then, while those a running, run the 4 filtering configurations with paracrawl v8, evaluate the 4 transformer models that use v7 and train a v8 model using the best filtering configuration according to v7.

So my understanding was that we run each filtering setting with all corpora (using v7 ParaCrawl) and BERT. The best of which would be applied to Paracrawl v8. At the same time, we would be training the 5th model, ELECTRA and evaluate its 12h/24h performance using one of the filter settings (I'm using OpusFilter-BasicCharLang).

jowagner commented 3 years ago

That's fine up to the point a v8 transformer model is trained. This must not put the NCI, WordPiece and vocab size experiments at risk for the deadline 1st of June. If electra can speed up development experiments we will be comfortable squeezing in a v8 experiment but otherwise let's wait with this to after the agreed experiments have been completed.

jbrry commented 3 years ago

I agree completely. I think it's better to have our experimental setup consistent, e.g. stick with v7 for all initial experiments:

filtering
with and without NCI
word-vs-sentencepiece
vocab sizes

Once we have carried out all of these experiments, we can add the v8 corpus in for our final model. Or squeeze in a all+v7 vs all+v8 experiment to see how much adding the v8 helps.

jowagner commented 3 years ago

Another advantage of staying with v7 for the moment is that we know it works with our pipeline.

fosterjen commented 3 years ago

I agree. Let's stick with v7 for now.

jowagner commented 2 years ago

On the data quality of ParaCrawl (and how to evaluate it): Ramírez‐Sánchez et al 2022 Human evaluation of web-crawled parallel corpora for machine translation