Whitespace normalization filter

hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

https://pypi.org/project/opuscleaner/

46 stars 13 forks source link

Whitespace normalization filter #128

Closed jindrahelcl closed 11 months ago

jindrahelcl commented 11 months ago

Throws away leading and trailing whitespace, also optionally merges whitespace groups into a single space.

XapaJIaMnu commented 11 months ago

So what happens when you have a line that contains a single (or few) whitespaces? This filter can produce an empty line (or an empty field out of it.). Ideally, I think every filter should handle the case where they could introduce empty lines. @jelmervdl thoughts?

ZJaume commented 11 months ago

Bifixer handles this and removes empty lines in case normalization produces one.

jindrahelcl commented 11 months ago

well, in practice, it does throw out the empty line... Which is weird because it shouldn't..

jindrahelcl commented 11 months ago

(somewhat related thought) it might make diffing simpler if there were "filters" that guarantee they output the same number of lines.

jelmervdl commented 11 months ago

I've been running into the issue of not knowing whether a filter just transforms, filters, or does both at every corner. I wish I had made a more clear distinction between the two, like OpusFilter has done. That would have made the interface, the diff, and writing wrappers for them much easier…

I'm going to open an issue about this and probably never refactor it but wishful thinking.

Edit: now written down in #130

jindrahelcl commented 11 months ago

So should I turn this into a filter-transform hybrid or leave it as a pure-transform step?

jindrahelcl commented 11 months ago

So now the filter is rewritten as a monolingual filter, which must preserve the number of lines. I.e. just strips everything from both ends and optionally collapse whitespaces in the middle.

jelmervdl commented 11 months ago

Ideally, I think every filter should handle the case where they could introduce empty lines.

I think that's painful, because then every filter that does a transform can also be a filter that removes lines. Or in terms of #130, every type 1 filter could also be a type 3 filter.

Right now we support monolingual filters that don't have to care about the column parsing, and those are wrapped with col.py to do that bit for them. This only works if the wrapped monolingual filter guarantees to return an equal number of lines, which it can't if it has to drop empty lines.

I can see value in a (default, but optional) option that makes opuscleaner-clean always produce output that has no empty lines and no empty fields. But let's just implement that as a default filter at the end, not something every filter has to take into account.

XapaJIaMnu commented 11 months ago

Bifixer handles this and removes empty lines in case normalization produces one.

But do we want to run everything through bifixer? I think every filter should try its best to also throw away resulting empty (or half empty) lines.

I can see value in a (default, but optional) option that makes opuscleaner-clean always produce output that has no empty lines and no empty fields. But let's just implement that as a default filter at the end, not something every filter has to take into account.

Yes, basically a hidden default filter that does that is better than relying on bifixer.

jindrahelcl commented 11 months ago

I think conceptually it is better if the filters are only doing their thing and not copying the logic of removing empty lines. If efficiency is a problem, in cases there is a lot of empty lines created by a filter, either the user knows, or there can be an implicit filter prepended in each step that just won't pass a line with an empty field.