Investigate merging document sentences in HPLT

eu9ene commented 6 days ago

We now have an implementation of HPLT 1.2 mono importer that can merge multiple lines from a document until it reaches a threshold of a maximum number of words or characters.

The idea is to provide the model with multi-sentence training examples so that we can do the same on inference. This would give the model more context and improve translation quality on the web pages.

It will be based on the maximum number of characters and disabled by default in this PR.

We should investigate: 1) Is it feasible to achieve? 2) What are the potential drawbacks? 3) Do other datasets have multi-sentence examples and how many? 3) If this can work, what line length distribution do we want to get from this dataset? Do we need to change the implementation for that?

@gregtatum @ZJaume please add your thoughts on this as well.

eu9ene commented 6 days ago

To clarify, we currently use HPLT for both back-translations and knowledge distillation, in both translation directions.

ZJaume commented 5 days ago

Is it feasible to achieve?

Yes, but I think this opens a whole new area of investigation: paragraph-level and document-level.

What are the potential drawbacks?

If the model that generates backtranslations has learnt it is ok to omit one of the sentences, or that it is ok to just finish decoding the first one (which is likely to happen), the generation of backtranslations with multi-sentence will just reaffirm that behaviour. So, the check that the model for backtranslations is not doing that, has to be done previously.

The way to avoid this behaviour is to add to the training data or finetune with corpora that has paragraph-level instances. The only one I know of is Europarl (and maybe MaCoCu but will require some work). It think it may be difficult to find/create parallel corpora that has correctly aligned paragraph-level or document-level. But to learn to handle more than one sentence more reliably, just a few tens of thousands of translation instances, could be enough

Do other datasets have multi-sentence examples and how many?

All the training corpora from websites for LLMs (RedPajama, CulturaX, FineWeb, HPLT...) have multi-sentence examples because no one is processing with sentence splitters. The only reason an endline would appear in a document is just because the HTML parser found the text in different HTML elements.

Note that newscrawl was not made for LLM training, so that's why is sentence level. The dataset is processed with sentence splitters. There are recent versiond for German and Czech that preserve the paragraph and document integrity.

Do we need to change the implementation for that?

The current implementation I think it will need to be done in a different way. If the purpose is to do paragraph-level translations, I would not change the monolingual datasets, as the sentences that already appear in one single line, are the ones more likely to be contiguous. But if the purpose is to go document-level, I would do the same for paragraphs and replace the endlines appearing in the datasets by a special token like __sep__. That way the model would learn the difference between both levels.

Maybe we need a new issue to talk in general about doc/paragraph-level :sweat_smile: .

gregtatum commented 4 days ago

Yeah, this is probably a wider discussion on how we want to design the inference engine as well. I would like to move more towards a larger paragraph model of translations, where concurrent sentence on the page are translated in the same translation buffer. That way shorter sentences and single words would have more context on what they mean. If we use separator tokens we can also reconstruct things correctly on the page level. This feels like we need a higher level design document for it with both the training side and inference side taken into consideration.

I received anecdotal feedback from users that our translation quality jumped by a large amount by correctly slicing up text to send the translator on an earlier issue, and I feel like this could be an area of great quality improvement, and probably speed improvement by sending in larger chunks of text for translation.

It doesn't sound particularly trivial though, and would require model training, or at least fine tuning to accomplish it. I don't think we've reserved extra tokens to take over for our vocabs, so it may just be a straight retrain.

mozilla / translations

Investigate merging document sentences in HPLT #923