Aligning Sentences in Freeform Text

johnml1135 commented 1 year ago

We will have a growing issue of breaking up longer form text (word documents, bloom books, concatentated verses, bible study notes, book chapters, etc.) that may have already translated content that we will want to train on. This is obviously not a new issue in NLP, but we should develop/choose/implement an approach that can work for our spectrum of use cases with minimal need for other existing text. Here are some papers from a simple google search:

These may already be implemented in Machine (or not). They may have the first application for breaking up larger segments into sub-segments - silnlp/issues/182.

johnml1135 commented 1 year ago

Also for poetry we may be able to translate multiple segments at the same time, and then split up the results using a method above. This could potentially improve poetry - where the segments are very short and more context is often needed.

johnml1135 commented 1 year ago

Two more from Joshua Nemecek:

DCNemesis commented 1 year ago

To just expand on use cases, if something works for phrase-to-phrase alignment, it may also work on aligning non-delimited text (like Thai, or in my case phonetic transcripts) to delimited text (and maybe even n-d to n-d). It's approximately the same problem, just on a different scale. Of course, on the smaller scale (characters vs. words), the scalability of the algorithm becomes more of a concern.

laura-burdick commented 1 year ago

My first thought on this problem is that, for texts and languages where we have paragraph breaks, we may not need something as granular as sentence alignment. Instead, if there's a way to align the paragraphs, for any paragraph that is under 200 tokens, we don't need to do any further alignment. We can segment the texts at the paragraph breaks. For paragraphs that are 200+ tokens, more granular sentence alignment would still be needed.

One of the papers John cited takes this approach, first doing a rough paragraph alignment, and then using their paragraph alignment, doing sentence alignment. Another paper creating a French-English parallel corpus does something similar.

I don't have a good sense for (1) how straightforward it will be to do paragraph alignment on our particular documents, and (2) how many paragraphs would be under 200 tokens.

For sentence alignment, here's an interesting paper comparing four sentence alignment techniques on English-Yorùbá parallel texts. Interestingly, they found that, for this particular language pair, some of the more modern sentence embedding approaches (Hunalign, Vecalign) don't work as well as the earlier Gale-Church method, which relies primarily on comparing sentence lengths. I'm not sure how well this would generalize to other language pairs, and it seems like plenty of folks are also using Vecalign, etc. successfully.

johnml1135 commented 1 year ago

A few parallel thoughts:

With verse markers, paragraph markers and good tokenization, we should be able to get most issues resolved.
If Gale-Church works for 99+%, then that is likely more than enough for what we need, at least at first.
- If this meets our needs and truly does work for most languages, then we can implement this method and kick any future investigations down the pike.

IF we need to do more...

We can start with the assumption that there is no sentence swapping - at least at first
- If that is wrong, would we be able to train an SMT alignment model and match based upon formal equivalence?
I wonder, would comparing sentence lengths vs. token lengths lead to different results?

johnml1135 commented 10 months ago

Another paper: https://aclanthology.org/2013.mtsummit-papers.10.pdf Yet Another Fast, Robust and Open Source Sentence Aligner. Time to Reconsider Sentence Alignment?

johnml1135 commented 10 months ago

This is really a few separate things that we could do. Some that are separate issue entirely:

Aligning substantially paraphrased or abridged texts - see https://github.com/sillsdev/silnlp/issues/239
Combine multiple short segments - see https://github.com/sillsdev/silnlp/issues/240

For this issue, we can investigate:

Breaking up large segments for pretranslation and then recombining them

This is the simplest problem in that it requires no NLP tooling but merely splitting up text by sentences.

The best algorithm may be this:

Determine the general ratio of source tokens to target tokens
If the # of tokens in is > max_tokens or the # of projected target tokens > 80% of max_tokens, break up the source segment into multiple segments
Break up the source segment into n segments on sentence boundaries so that each segment as the evenest number of tokens/characters (another algorithm?)

Breaking up large segment paris used for training data

Use Gale-Church from https://www.nltk.org/api/nltk.align.html to align source-target sentences
Use the average source and target lengths for G-C aligned sentences when splitting into multiple segments
Use 100% of max_tokens for source and target
Optional threshold to throw out poorly aligning sentences

johnml1135 commented 10 months ago

@ddaspit - as I am digging into this, I am finding a potentially significant issue in actually determining the sentence breaks, as it appears to be different for each different language. Does Paratext have enough info in the project to be able to disambiguate? We can do English easily - but can we do all the other languages?

ddaspit commented 10 months ago

You can start with the LatinSentenceTokenizer in Machine. It attempts to support all languages that use the Latin script. You will need to provide it with a list of common abbreviations, such as "Dr", "Mr", "Mrs", etc. to get the best results.

sillsdev / silnlp