Here are a few things to performance:
* make Vietnameze normalization optional
* make "strip URLSs" and "strip email" optional: some (most?) "real"
application have some kind of text filtering; this library is only intended for
language detection; markup removal is another topic.
* use StringBuilder? instead of StringBuffer? for local variables as
synchronization is not needed
* keep a static cache for normalization and uppercase: this will require more
memory but increase performance.
I have created o clone of the project and pushed the changes there (under
optimizations "branch":
https://code.google.com/r/ionutcpaduraru-language-detection/
Here is the changeset
https://code.google.com/r/ionutcpaduraru-language-detection/source/detail?r=1324
8df53f642409c7b0ab31ddc030b91c9afadb&name=optimizations
Feel free to use (or not to use) any of those changes.
Original issue reported on code.google.com by ionut.c....@gmail.com on 16 Feb 2013 at 11:23
Original issue reported on code.google.com by
ionut.c....@gmail.com
on 16 Feb 2013 at 11:23