malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

Word count utility method #22

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

Just an idea to help some people who have problems with short sentences (<10 or 
15 words). It used to be a problem for me.

Perhaps, add a utility method in some class (Detector class?) to count the 
number of words of a String.

For example (it is the one I use - a static method):

    public static int wordCount(String line) {
        int idx = 0;
        int cnt = 0;
        while (idx < line.length()) {
            if (!Character.isLetter(line.charAt(idx))) {
                idx++;
                continue;
            }
            cnt++;
            while (idx < line.length() && Character.isLetter(line.charAt(idx))) {
                idx++;
            }
        }
        return cnt;
    }

Or, why not throwing an exception directly when the String does not contain 
enough word (LangDetectException: "To few words")

But it is just an idea. 

But, indeed, it is a great API, with good response time and the ability to add 
supported languages. Thanks a lot.

Regards,
Emmanuel

Original issue reported on code.google.com by zygolech...@gmail.com on 29 Aug 2011 at 9:25

GoogleCodeExporter commented 9 years ago
Hello Emmanuel,

I think that 'What is WORD?' is different for each application, so langdetect 
don't provide such utility methods.
But very thanks.

I might write the same function like the following (but this code is not tested 
:P)

    public static int wordCount(String line) {
        int cnt = 0;
        boolean pre_is_letter = false;
        for (int idx=0;idx<line.length();idx++) {
            boolean is_letter = Character.isLetter(line.charAt(idx));
            if (!pre_is_letter && is_letter) cnt++;
            pre_is_letter = is_letter;
        }
        return cnt;
    }

Original comment by nakatani.shuyo on 30 Aug 2011 at 3:43

GoogleCodeExporter commented 9 years ago
Thanks for the review of the method. I have tested it. It works. But I have not 
calculated the performances gain even if I think there is one. The first 
version has 2 loops with one 'continue' (worst thing that can append). The 
second has only one 'classical' loop.

Original comment by zygolech...@gmail.com on 30 Aug 2011 at 8:11