sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 3 forks source link

Aligning Sentences in Freeform Text #250

Open johnml1135 opened 1 year ago

johnml1135 commented 1 year ago

We will have a growing issue of breaking up longer form text (word documents, bloom books, concatentated verses, bible study notes, book chapters, etc.) that may have already translated content that we will want to train on. This is obviously not a new issue in NLP, but we should develop/choose/implement an approach that can work for our spectrum of use cases with minimal need for other existing text. Here are some papers from a simple google search:

These may already be implemented in Machine (or not). They may have the first application for breaking up larger segments into sub-segments - silnlp/issues/182.

johnml1135 commented 1 year ago

Also for poetry we may be able to translate multiple segments at the same time, and then split up the results using a method above. This could potentially improve poetry - where the segments are very short and more context is often needed.

johnml1135 commented 1 year ago

Two more from Joshua Nemecek:

DCNemesis commented 1 year ago

To just expand on use cases, if something works for phrase-to-phrase alignment, it may also work on aligning non-delimited text (like Thai, or in my case phonetic transcripts) to delimited text (and maybe even n-d to n-d). It's approximately the same problem, just on a different scale. Of course, on the smaller scale (characters vs. words), the scalability of the algorithm becomes more of a concern.

laura-burdick commented 1 year ago

My first thought on this problem is that, for texts and languages where we have paragraph breaks, we may not need something as granular as sentence alignment. Instead, if there's a way to align the paragraphs, for any paragraph that is under 200 tokens, we don't need to do any further alignment. We can segment the texts at the paragraph breaks. For paragraphs that are 200+ tokens, more granular sentence alignment would still be needed.

One of the papers John cited takes this approach, first doing a rough paragraph alignment, and then using their paragraph alignment, doing sentence alignment. Another paper creating a French-English parallel corpus does something similar.

I don't have a good sense for (1) how straightforward it will be to do paragraph alignment on our particular documents, and (2) how many paragraphs would be under 200 tokens.

For sentence alignment, here's an interesting paper comparing four sentence alignment techniques on English-Yorùbá parallel texts. Interestingly, they found that, for this particular language pair, some of the more modern sentence embedding approaches (Hunalign, Vecalign) don't work as well as the earlier Gale-Church method, which relies primarily on comparing sentence lengths. I'm not sure how well this would generalize to other language pairs, and it seems like plenty of folks are also using Vecalign, etc. successfully.

johnml1135 commented 1 year ago

A few parallel thoughts:

IF we need to do more...

johnml1135 commented 10 months ago

Another paper: https://aclanthology.org/2013.mtsummit-papers.10.pdf Yet Another Fast, Robust and Open Source Sentence Aligner. Time to Reconsider Sentence Alignment?

johnml1135 commented 10 months ago

This is really a few separate things that we could do. Some that are separate issue entirely:

For this issue, we can investigate:

Breaking up large segments for pretranslation and then recombining them

johnml1135 commented 10 months ago

@ddaspit - as I am digging into this, I am finding a potentially significant issue in actually determining the sentence breaks, as it appears to be different for each different language. Does Paratext have enough info in the project to be able to disambiguate? We can do English easily - but can we do all the other languages?

ddaspit commented 10 months ago

You can start with the LatinSentenceTokenizer in Machine. It attempts to support all languages that use the Latin script. You will need to provide it with a list of common abbreviations, such as "Dr", "Mr", "Mrs", etc. to get the best results.