[Feature] Unicode character classes are slow [sf#8]

lsf37 commented 9 years ago

Reported by willink on 2002-09-04 15:46 UTC

I just moved up from JLex.

Much impressed by the better speed, error detection.

Had a problwm with exponential time and memopry on 1.3.5, which seems to be much alleviated in the 1.4_pre1 (it now runs).

I suspest there is still something that could be done to speed large Unicode grammars.

It seems wrong that just expanding the number of lenumerated elements in an unchanging number of input character classes should change the number of DFA states, and consequently the DFA to NFA conversion time.

To demonstrate, use a typical XML grammar (attached). It requires 3187 NFA states, whereas after commenting out all 16 bit character lasses it needs only 1000 odd. The latter compiles quite rapidly, the former si slow but tractiavble with the pre-release.

lsf37 commented 9 years ago

Commented by willink on 2002-09-04 15:46 UTC Slow to compile grammar

lsf37 commented 9 years ago

Commented by lsf37 on 2002-09-25 12:23 UTC Logged In: YES user_id=93534

Sorry for the late answer, the bug tracker's "monitor" function seems to have let me down.

The problem with the character classes in the attached
grammar is that they are formulated with the "|" operator, not
as a pure char class expression (which would use only one [..]). JFlex translates this as the general "or" (which increases the initial DFA size) and not as character class (which wouldn't). The final DFA size (after minimization) should be the same, though (which doesn't help you, of course, if JFlex never reaches that state).

I realize that it is impractical to have char classes of this size
in one single expression in the specification. I will see if I can optimize the first stage (RegExp->DFA) so that JFlex recognizes when "|" is used for char class enumeration only.

lsf37 commented 9 years ago

Updated by lsf37 on 2002-09-25 12:25 UTC

labels: --> generator
assigned_to: nobody --> lsf37

lsf37 commented 9 years ago

Updated by lsf37 on 2004-04-12 12:31 UTC

status: open --> closed

jflex-de / jflex

[Feature] Unicode character classes are slow [sf#8] #143