Auto-sentence breaker - Githubissues

johnml1135 commented 11 months ago

Is this even possible? Can we with minimal a priori knowledge can we separate sentences in all languages in all scripts enough so that when combined with a Gale-Church sentence Aligner, we can get decent training and translation data within our 200 token maximum? It doesn't have to be perfect - and splitting up sentences more than they should be may be ok. The main issues are - what will we find in different languages in terms of sentence ending punctuation?

First proposal idea:

Have knowledge of the sentence terminating characters per script
Analyze the incoming text to determine the script (or use the script code in the language code)
Follow some basic rule-based knowledge:
- Refer to https://github.com/uhermjakob/utoken/tree/main for many rules that may span scripts (emails, URL's)
- If under 200? tokens, keep parenthesis and brackets unified
Are there abbreviations in other scripts that also use the sentence terminating character? (D. L. Moody, etc.). If we know a bit about the types of things we could find, we may be able to develop some statistics-rule based determination for sentence breaks
Use ebible data to test out algorithms

ddaspit commented 11 months ago

We don't need a perfect solution. We probably only need a small set of sentence tokenizers to cover the majority of the world's scripts. We could use the script code to choose the default sentence tokenizer and make it configurable for exceptional cases.

johnml1135 commented 11 months ago

That is what I am thinking about. I am going to try out a little test script looking at the ebible corpus. I can see how far we can get there.

johnml1135 commented 11 months ago

Most common endings in ebible corpus (by number of occurances):

ddaspit commented 11 months ago

I would say that if we had a sentence tokenizer for scripts that use a period (LatinSentenceTokenizer) and a sentence tokenizer for scripts without word breaks, then we would have good coverage for most scripts.

johnml1135 commented 11 months ago

Some non-period endings:

To get the last few percent, we can likely do a bit of simple statistical analysis (as above) to get 80%+ of the breaks.

johnml1135 commented 11 months ago

Here is my current understanding:

There is a handful of sentence terminating characters that are 90+% used for sentence ends including:
- . ! ? | । ። 。」 ॥ ။ ฯ ۔ ׃ ། appear to be sentence markers
- 」¶ - appear to be paragraph markers
Some termination characters also appear to be used as quote markers or semicolons.
If there are extra sentence markers (that are not filtered out as quotes, etc.), Church-Gale should adapt fairly well to them.

A straightforward algorithm for breaking up sentences can then be:

Make all the terminating characters above "sentence markers" with the following logic:
- Break on all sentence markers
- Only recognize paragraph markers if they come after a sentence terminator (and optional whitespace).
- Combine multiple sentence markers into a single sentence marker.
- Group all paragraph and sentence markers with the first sentence.

The test to perform would be:

Group verses by contiguous verses as per the kjv paragraph markers
Using the KJV paired with other texts, call sentences in the kjv entire verses if they end in a period and the parallel verse ends in a sentence marker or combine verses until they do
Break the parallel text on the markers and using gale-church, try to break apart the target verses at the right location.
Let's try to get 95%+ for all kjv and other language pairs.

johnml1135 commented 11 months ago

Also, for Thai and a few other languages/scipts, there are no word breaks and sentences are demarkated by whitespace.

mmartin9684-sil commented 11 months ago

The end-of-sentence detection will probably need a little logic around where the sentence actually ends, particularly for dialog. The end-of-sentence likely needs to be extended to include the quote marks following the terminating character. For instance: MAT 2:8 - And he sent them to Bethlehem, saying, "Go and search diligently for the child, and when you have found him, bring me word, that I too may come and worship him."

johnml1135 commented 11 months ago

For the initial stab, I'm getting pretty bad error rates:

ddaspit commented 11 months ago

NLTK has an implementation of the well-known Punkt tokenizer, which is a trainable sentence tokenizer. You can also take a look at the SentenceRecognizer in spaCy, which is also trainable. It looks like a simple neural network. We could try training on the target data for a single language when we perform a build, or train on a multilingual corpus to see if we can create a general-purpose model.

johnml1135 commented 11 months ago

Both of those look pretty good. In looking at the data even with my very dumb tokenizer, I am seeing the following significant issues:

Gale-Church assumes at most 1 sentence into 2 - that is not the case for many languages - I have seen 1 into 4 for many instances. This can be worked around (by telling it to look for those types of associations) but it makes the accuracy harder to come by.
The "truth" metric of characters per sentence breaks down for many different languages. Depending on the concept to express, it may either be done with few or many words. Gale-Church was done on German, French and English - not on all the languages of the world.

ddaspit commented 11 months ago

Gale-Church is an old method that isn't really used anymore. The more modern approach is to compute a distance measure between sentence embeddings to align sentences. NLLB used LASER3 encoders to generate sentence embeddings. It performs a global search on a monolingual corpus to create the parallel corpus. Rather than introducing the errors from the sentence alignment algorithm in order to evaluate sentence tokenization, I think it would be better to just create a gold standard dataset. I don't think it would be difficult to create a gold standard. You could run your simple tokenizer on a number of random verses from various translations in the eBible corpus and then manually correct the results.

johnml1135 commented 11 months ago

Gold standard with Spacy's SentenceRecognizer seems to be the best plan:

I'll try it out.

sillsdev / silnlp

Auto-sentence breaker #258