ulaval-rs / trombone

GNU General Public License v3.0
0 stars 0 forks source link

How to handle acronym when counting sentence? #13

Closed gacou54 closed 2 years ago

gacou54 commented 3 years ago

Some algorithms count the number of sentences. When acronyms are encounter, they can be wrongly counted as a sentence. This is because the sentence counter uses a dot with a following space as an indicator of a sentence end (.).

For example, the following sentence counts as two. The U.S. Office is here.

gacou54 commented 3 years ago

The following logic is used to prevent a sentence count on acronyms in feature/coleman-liau-index branch for the Coleman-Liau Index. It works for the acronyms with multiples dots.

else if (c == '.')
    if (i == length - 1) // This is the end of the text
        nbrOfSentences++;

    // This logic excludes the acronym with two dot (e.g. "The U.S. Office is here.").
    // It looks for another dot two characters before a dot with a following space ". ".
    else if (text.charAt(i + 1) == ' ')
        if (i != 1 && i != 2)
            if (!text.substring(i-2, i).contains("."))
                nbrOfSentences++;

Acronyms with a single dot are not handled right now (e.g. etc. will be counted as a sentence end).

gacou54 commented 2 years ago

Closing this issue, I consider the solution OK.