Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene

dweiss commented 4 years ago

Apache Lucene has a jflex definition file (UAX29URLEmailTokenizerImpl.jflex) that is 21kb and includes some other jflex files that add ~40kb. The generated file (DFA) is 2.2MB so it's still relatively small. The generation process takes a whooping 10 minutes and requires 10 gigs of ram though – most of it is spent in constructing NFA and computing epsilon closures.

I though it'd be interesting to look into why this behaves this way. Perhaps it'd provide some clues for optimization.

You can reproduce construction on Lucene repository (https://github.com/apache/lucene-solr/)

java -jar c:\_tmp\jflex-1.7.0\lib\jflex-full-1.7.0.jar --verbose --encoding "UTF-8" lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex --skel lucene/core/src/data/jflex/skeleton.disable.buffer.expansion.txt

dweiss commented 4 years ago

Seems like most of the problem is caused by this: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/ASCIITLD.jflex-macro

I don't know if there is any other replacement that wouldn't explode the NFA internally...

dweiss commented 4 years ago

Right... I think we're experiencing this exactly.

  /**
   * Constructs an NFA accepting the complement of the language of a given NFA.
   *
   * <p>Converts the NFA into a DFA, then negates that DFA. Exponential state blowup possible and
   * common.
...
  private IntPair complement(IntPair nfa) {

lsf37 commented 4 years ago

Sorry for the silence, was mostly offline last week. I didn't directly spot any negation operator (!) in ASCIITLD.jflex-macro, but I might have overlooked it. Or is it in a spec that includes this one?

Basically, the NFA->DFA transformation is always potentially exponential, and the negation operator includes one. There are cases that must be exponential, because even the minimal DFA contains exponentially more states than the NFA, and there are cases where it's "just" exponential in the construction. 2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

One thing that is notable in ASCIITLD.jflex-macro is that it's a very large | of basically strings. The | constructs an NFA for each part and then connects them all to an initial state by epsilon transitions. This would explain why you are seeing a lot of time spent in epsilon closures, and it also explains the state explosion, because the result of that are basically DFA states that track which set of these alternatives are still possible. The size of these sets is potentially roughly exponential in the number of alternatives and maybe their length, depending on how soon the alternatives diverge from each other.

One thing to try would be to factor out common prefixes, e.g.

| [aA][bB][aA][rR][tT][hH]
| [aA][bB][bB]
| [aA][bB][bB][oO][tT][tT]
| [aA][bB][bB][vV][iI][eE]

=

 [aA][bB] ([aA][rR][tT][hH] | [bB] ([oO][tT][tT] | [vV][iI][eE])?

This should reduce the number of DFA states in initial construction, because the first character in the alternative already distinguishes which one it is. It basically does something that DFA minimisation would take care of later. The spec is big, of course, and manually transforming it will be error prone, so before even attempting it, it would be worth checking out if it helps enough.

The other question is if there is a negation operator on the big expression, if that operator could be somehow avoided.

lsf37 commented 4 years ago

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

dweiss commented 4 years ago

I am on short holidays this week, thanks for looking into it. I will get back to this once I am back. D.

lsf37 commented 4 years ago

No worries. If the new feature helps, we can probably push out the JFlex 1.8.0 release next week, so you can have the build on an official version.

dweiss commented 4 years ago

Hi Gerwin. Nah, I observe the same behavior on master, sadly. Compilation and takes forever and requires huge amounts of ram.

2.2MB in the final DFA is pretty big, so it sounds like at least some small part of it is not reducible, but compared to 10GB it's tiny, so at least some part should be reducible.

Yes, that's my impression too. I bet something could be optimized along the way (maybe at the cost of algorithm clarity) to reduce the number of epsilon transitions before the conversion to a DFA takes place. I wish I had more time to help out!

Tweaking ASCIITLD.jflex-macro is certainly possible although it'll make the definition very complex and hard to understand. As it stands now you can tell what it's doing; I think the compiler should implement this state reduction internally - this shouldn't be too difficult even with a hash-based state deduplication? It's easy for me to say, of course. :)

lsf37 commented 4 years ago

Alight, I'll look into it more deeply, surely there must be something we can do. It'll be a bit, though, next week (hopefully) for is the 1.8.0 release first.

dweiss commented 4 years ago

No problem at all - this isn't a very pressing matter as we regenerate those automata very infrequently. I was just surprised to see such vast disproportion in memory consumption between generation and final automaton.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

lsf37 commented 4 years ago

Ok, good to know that you're not stuck because of it.

The hash state reduction idea (in case you didn't understand my brief note) is about deduplicating states based on an associative container where you can detect nodes with the same input/ output transitions (pointing at the same neighbouring nodes). I glanced at the code and saw bitsets used heavily so it may not be a trivial change...

Yes, that might be a bit of a project. I think I'll try extracting common factors on the regexp AST first, and see if that has an impact.

rmuir commented 2 years ago

Just had a look at UAX29URLEmailTokenizerImpl.jflex: it uses ! to work around the no-macros-in-charclasses restriction. This is now implemented in JFlex master (#654) and will be in the next release. That might well fix your problem.

Are you able to try out the master branch?

@lsf37 I finally tried out the new feature (macros in charclasses) and it's definitely a significant improvement! 285.26 seconds versus 918.87 sec.

So this NFA-complement that was happening due to the workaround was at least amplifying the problem significantly for us, progress.

Thanks!

lsf37 commented 1 year ago

Closing this as the maco-in-char-classes feature seems to have alleviated the problem and opened #1026 as a tracking issue for the potential common-prefix optimisation mentioned in this issue.

jflex-de / jflex

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715