Open johnml1135 opened 11 months ago
We don't need a perfect solution. We probably only need a small set of sentence tokenizers to cover the majority of the world's scripts. We could use the script code to choose the default sentence tokenizer and make it configurable for exceptional cases.
That is what I am thinking about. I am going to try out a little test script looking at the ebible corpus. I can see how far we can get there.
Most common endings in ebible corpus (by number of occurances):
. 957
| 14
। 11
። 4
。 4
׃
> 3
॥ 2
۔ 2
ν 1
น 1
) 1
၊ 1
ฯ 1
། 1
- 1
။ 1
។ 1
I would say that if we had a sentence tokenizer for scripts that use a period (LatinSentenceTokenizer
) and a sentence tokenizer for scripts without word breaks, then we would have good coverage for most scripts.
Some non-period endings:
To get the last few percent, we can likely do a bit of simple statistical analysis (as above) to get 80%+ of the breaks.
Here is my current understanding:
. ! ? | । ። 。 」 ॥ ။ ฯ ۔ ׃ །
appear to be sentence markers」¶ -
appear to be paragraph markersA straightforward algorithm for breaking up sentences can then be:
The test to perform would be:
Also, for Thai and a few other languages/scipts, there are no word breaks and sentences are demarkated by whitespace.
The end-of-sentence detection will probably need a little logic around where the sentence actually ends, particularly for dialog. The end-of-sentence likely needs to be extended to include the quote marks following the terminating character. For instance: MAT 2:8 - And he sent them to Bethlehem, saying, "Go and search diligently for the child, and when you have found him, bring me word, that I too may come and worship him."
For the initial stab, I'm getting pretty bad error rates:
NLTK has an implementation of the well-known Punkt tokenizer, which is a trainable sentence tokenizer. You can also take a look at the SentenceRecognizer
in spaCy, which is also trainable. It looks like a simple neural network. We could try training on the target data for a single language when we perform a build, or train on a multilingual corpus to see if we can create a general-purpose model.
Both of those look pretty good. In looking at the data even with my very dumb tokenizer, I am seeing the following significant issues:
Gale-Church is an old method that isn't really used anymore. The more modern approach is to compute a distance measure between sentence embeddings to align sentences. NLLB used LASER3 encoders to generate sentence embeddings. It performs a global search on a monolingual corpus to create the parallel corpus. Rather than introducing the errors from the sentence alignment algorithm in order to evaluate sentence tokenization, I think it would be better to just create a gold standard dataset. I don't think it would be difficult to create a gold standard. You could run your simple tokenizer on a number of random verses from various translations in the eBible corpus and then manually correct the results.
Gold standard with Spacy's SentenceRecognizer seems to be the best plan:
I'll try it out.
Is this even possible? Can we with minimal a priori knowledge can we separate sentences in all languages in all scripts enough so that when combined with a Gale-Church sentence Aligner, we can get decent training and translation data within our 200 token maximum? It doesn't have to be perfect - and splitting up sentences more than they should be may be ok. The main issues are - what will we find in different languages in terms of sentence ending punctuation?
First proposal idea: