optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

MAIL_REGEX should be limited #96

Open tballison opened 5 years ago

tballison commented 5 years ago

If you try to detect a string with 50000 'a's, the MAIL_REGEX in URLTextFilter takes a really, really long time.

If you add reasonable limits, the performance is much better. private static final Pattern MAILREGEX = Pattern.compile("[-.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

to->

    private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,250}@[-_0-9A-Za-z]{1,250}[-_.0-9A-Za-z]{1,250}");