Open jowagner opened 3 years ago
Update: We have decided to carry out the 4 filtering experiments on v7 and will apply the best setting when using v8.
Do you mean before measuring the effect of leaving out the NCI? Did you discuss this with Jennifer? (Who is "we" in "We have decided" here?) Given the tight time frame I wouldn't change the agreed plan, especially not if electra cannot speed up the development phase, without top-level approval.
edit: typo opt --> top
This would be before the leave-NCI-out experiment but would correspond to the 4 filtering experiments we decided to just use regular BERT for.
My response was based on our recent email thread:
I guess this depends how long it takes to re-run the filtering configurations. If it takes more than a day, my suggestion would be to start pre-training transformer models (bert or electra, see question above), then, while those a running, run the 4 filtering configurations with paracrawl v8, evaluate the 4 transformer models that use v7 and train a v8 model using the best filtering configuration according to v7.
So my understanding was that we run each filtering setting with all corpora (using v7 ParaCrawl) and BERT. The best of which would be applied to Paracrawl v8. At the same time, we would be training the 5th model, ELECTRA and evaluate its 12h/24h performance using one of the filter settings (I'm using OpusFilter-BasicCharLang
).
That's fine up to the point a v8 transformer model is trained. This must not put the NCI, WordPiece and vocab size experiments at risk for the deadline 1st of June. If electra can speed up development experiments we will be comfortable squeezing in a v8 experiment but otherwise let's wait with this to after the agreed experiments have been completed.
I agree completely. I think it's better to have our experimental setup consistent, e.g. stick with v7 for all initial experiments:
Once we have carried out all of these experiments, we can add the v8 corpus in for our final model. Or squeeze in a all+v7 vs all+v8 experiment to see how much adding the v8 helps.
Another advantage of staying with v7 for the moment is that we know it works with our pipeline.
I agree. Let's stick with v7 for now.
On the data quality of ParaCrawl (and how to evaluate it): Ramírez‐Sánchez et al 2022 Human evaluation of web-crawled parallel corpora for machine translation
Filtering has improved, including a filter trying to remove MT output (but still there are many easy to spot, for an Irish speaker, MT output)