hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
https://pypi.org/project/opuscleaner/
46 stars 13 forks source link

[Discussion] Filter that checks for numerical sequences? #69

Closed XapaJIaMnu closed 1 year ago

XapaJIaMnu commented 1 year ago

Do we want a filter that checks for the presence of numerical sequences on both sides? Looking through CCAligned, there's some places where numbers are present on one side, but absent on another, which suggests that the two sentences are not parallel. There are cases where numbers would differ on both sides (Currency conversions/imperial-metric system shenanigans etc).

Do we have a rule for that somewhere in bicleaner? Has anyone experimented with that @jelmervdl @ZJaume ?

ZJaume commented 1 year ago

Bicleaner Hardrules already has that rule (disabled by default), we could use it.

jelmervdl commented 1 year ago

I also have a stand-alone filter that does the same thing with some wiggle room.

I also have an attempt at one that fixes it if there's a mismatch so we don't have to throw the pair away but that was a bit of a failure. Too hard to get right.

XapaJIaMnu commented 1 year ago

Ok, so we already have it and I didn't find it, so i think we can close this.