sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
31 stars 3 forks source link

Auto-sentence breaker #258

Open johnml1135 opened 10 months ago

johnml1135 commented 10 months ago

Is this even possible? Can we with minimal a priori knowledge can we separate sentences in all languages in all scripts enough so that when combined with a Gale-Church sentence Aligner, we can get decent training and translation data within our 200 token maximum? It doesn't have to be perfect - and splitting up sentences more than they should be may be ok. The main issues are - what will we find in different languages in terms of sentence ending punctuation?

First proposal idea:

ddaspit commented 10 months ago

We don't need a perfect solution. We probably only need a small set of sentence tokenizers to cover the majority of the world's scripts. We could use the script code to choose the default sentence tokenizer and make it configurable for exceptional cases.

johnml1135 commented 10 months ago

That is what I am thinking about. I am going to try out a little test script looking at the ebible corpus. I can see how far we can get there.

johnml1135 commented 10 months ago

Most common endings in ebible corpus (by number of occurances):

.    957
|     14
।     11
።      4
。      4
׃      
>      3
॥      2
۔      2
ν      1
น      1
)      1
၊      1
ฯ      1
།      1
-      1
။      1
។      1
ddaspit commented 10 months ago

I would say that if we had a sentence tokenizer for scripts that use a period (LatinSentenceTokenizer) and a sentence tokenizer for scripts without word breaks, then we would have good coverage for most scripts.

johnml1135 commented 10 months ago

Some non-period endings: image

To get the last few percent, we can likely do a bit of simple statistical analysis (as above) to get 80%+ of the breaks.

johnml1135 commented 10 months ago

Here is my current understanding:

A straightforward algorithm for breaking up sentences can then be:

The test to perform would be:

johnml1135 commented 10 months ago

Also, for Thai and a few other languages/scipts, there are no word breaks and sentences are demarkated by whitespace.

mmartin9684-sil commented 10 months ago

The end-of-sentence detection will probably need a little logic around where the sentence actually ends, particularly for dialog. The end-of-sentence likely needs to be extended to include the quote marks following the terminating character. For instance: MAT 2:8 - And he sent them to Bethlehem, saying, "Go and search diligently for the child, and when you have found him, bring me word, that I too may come and worship him."

johnml1135 commented 10 months ago

For the initial stab, I'm getting pretty bad error rates: image

ddaspit commented 10 months ago

NLTK has an implementation of the well-known Punkt tokenizer, which is a trainable sentence tokenizer. You can also take a look at the SentenceRecognizer in spaCy, which is also trainable. It looks like a simple neural network. We could try training on the target data for a single language when we perform a build, or train on a multilingual corpus to see if we can create a general-purpose model.

johnml1135 commented 10 months ago

Both of those look pretty good. In looking at the data even with my very dumb tokenizer, I am seeing the following significant issues:

ddaspit commented 10 months ago

Gale-Church is an old method that isn't really used anymore. The more modern approach is to compute a distance measure between sentence embeddings to align sentences. NLLB used LASER3 encoders to generate sentence embeddings. It performs a global search on a monolingual corpus to create the parallel corpus. Rather than introducing the errors from the sentence alignment algorithm in order to evaluate sentence tokenization, I think it would be better to just create a gold standard dataset. I don't think it would be difficult to create a gold standard. You could run your simple tokenizer on a number of random verses from various translations in the eBible corpus and then manually correct the results.

johnml1135 commented 9 months ago

Gold standard with Spacy's SentenceRecognizer seems to be the best plan:

I'll try it out.