Speed-up for nested block pattern matching

GoogleCodeExporter commented 9 years ago

Before change:

input string length: 475
4000 iterations in 3814 ms (0.9535 ms per iteration)
input string length: 2356
1000 iterations in 4215 ms (4.215 ms per iteration)
input string length: 27737
100 iterations in 5908 ms (59.08 ms per iteration)
input string length: 11075
1 iteration in 25 ms
input string length: 88607
1 iteration in 278 ms
input string length: 354431
1 iteration in 2386 ms

After:

input string length: 475
4000 iterations in 3756 ms (0.939 ms per iteration)
input string length: 2356
1000 iterations in 4196 ms (4.196 ms per iteration)
input string length: 27737
100 iterations in 4753 ms (47.53 ms per iteration)
input string length: 11075
1 iteration in 23 ms
input string length: 88607
1 iteration in 190 ms
input string length: 354431
1 iteration in 1027 ms

with all unit tests passing.

So a moderate speed-up.

Change to:

        private static Regex _blocksNested = new Regex(string.Format(@"
                (                       # save in 
$1
                    ^                   # start of line  
(with /m)
                    <({0})              # start tag = $2
                    \b                  # word break
                    (?>.*\n)*?          # any number of lines, 
minimally matching
                    </\2>               # the matching end 
tag
                    [ \t]*              # trailing 
spaces/tabs
                    (?=\n+|\Z)          # followed by a newline or end of 
document
                )", _blockTags1), RegexOptions.Multiline | 
RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);

        private static string _blockTags2 = "p|div|h[1-
6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math"
;
        private static Regex _blocksNestedLiberal = new 
Regex(string.Format(@"
               (                        # save in 
$1
                    ^                   # start of line  
(with /m)
                    <({0})              # start tag = $2
                    \b                  # word break
                    (?>.*\n)*?          # any number of lines, 
minimally matching
                    .*</\2>             # the matching end 
tag
                    [ \t]*              # trailing 
spaces/tabs
                    (?=\n+|\Z)          # followed by a newline or end of 
document
                )", _blockTags2), RegexOptions.Multiline | 
RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);

The important part is:

(?>.*\n)*?

instead of:

(.*\n)*?

Original issue reported on code.google.com by wcshie...@gmail.com on 4 Jan 2010 at 10:22

GoogleCodeExporter commented 9 years ago

beware, this area is slated to change entirely in 1.07/1.08 -- the last two 
failing
tests have to do with the horribly brokem HTML block parser..

Original comment by wump...@gmail.com on 5 Jan 2010 at 1:05

GoogleCodeExporter commented 9 years ago

Thanks for the contribution -- unfortunately now obselete based on new Html 
block
parser in r74

Original comment by wump...@gmail.com on 6 Jan 2010 at 10:56

Changed state: WontFix

jawkom / markdownsharp

Speed-up for nested block pattern matching #17