optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Improved mail regex #111

Closed jonmv closed 2 years ago

jonmv commented 2 years ago

There are two changes here:

  1. Include + in the local part, and disallow _ in the domain part. There are other characters that are allowed in the local part as well, but these are less common (https://en.wikipedia.org/wiki/Email_address).
  2. Optimise the pattern for the case of long contiguous strings with characters from the first character set, but without any @ (or otherwise non-matching).

Currently, the replaceAll(" ") on a string of ~100K characters from the set [-_.0-9A-Za-z] runs in ~1minute on modern hardware; adding a negative lookbehind with one of the characters from that set reduces this to a few milliseconds, and is functionally equivalent. (Consider the current pattern and a match from position i to k. If the character at i-1 is in the character set, there would also be a match from i-1 to k, which would already be replaced.)