Detab function performance and \t matching in regexes

GoogleCodeExporter commented 9 years ago

The original Detab function uses a regex (_deTab) to convert tabs to an 
appropriate amount of spaces. Looking at this approach, it seems that a lot 
of strings are generated rather unnecessary, and a lot of overhead is 
involved.

Preliminary performance benchmark was:
input string length: 475
6000 iterations in 4691 ms (0,781833333333333 ms per iteration)
input string length: 2356
1500 iterations in 4489 ms (2,99266666666667 ms per iteration)
input string length: 27737
150 iterations in 5159 ms (34,3933333333333 ms per iteration)
input string length: 11075
300 iterations in 4506 ms (15,02 ms per iteration)
input string length: 88607
35 iterations in 4268 ms (121,942857142857 ms per iteration)
input string length: 354431
10 iterations in 4857 ms (485,7 ms per iteration)

By doing this fairly crucial (many input lines match _deTab at least once) 
part "manually" - see attached diff file for exact code - an about 10% 
performance gain can be achieved:
input string length: 475
6000 iterations in 4187 ms (0,697833333333333 ms per iteration)
input string length: 2356
1500 iterations in 4011 ms (2,674 ms per iteration)
input string length: 27737
150 iterations in 4464 ms (29,76 ms per iteration)
input string length: 11075
300 iterations in 4087 ms (13,6233333333333 ms per iteration)
input string length: 88607
35 iterations in 3922 ms (112,057142857143 ms per iteration)
input string length: 354431
10 iterations in 4451 ms (445,1 ms per iteration)

Also, it is now fairly obvious that all '\t' are removed from the input 
stream at this point, therefore all "[ \t]" (or similar) parts in the 
regular expressions can have their tab-matching stripped out. The savings 
in this (apart from readability) vary pretty big with respect to the input:

input string length: 475
6000 iterations in 4114 ms (0,685666666666667 ms per iteration)
input string length: 2356
1500 iterations in 3826 ms (2,55066666666667 ms per iteration)
input string length: 27737
150 iterations in 4239 ms (28,26 ms per iteration)
input string length: 11075
300 iterations in 3884 ms (12,9466666666667 ms per iteration)
input string length: 88607
35 iterations in 3719 ms (106,257142857143 ms per iteration)
input string length: 354431
10 iterations in 4264 ms (426,4 ms per iteration)

All unit tests pass the same after these modifications.

Original issue reported on code.google.com by Shio...@gmail.com on 10 Jan 2010 at 1:43

Attachments:

detab.diff

GoogleCodeExporter commented 9 years ago

Actually, I just noticed that "CodeBlockEvaluator" calls Detab again. Since no 
tabs are  
inserted anywhere, this precaution is unnecessary and should be removable. This 
does 
not have any measurable impact on the Benchmark results on my machine, however.

Original comment by Shio...@gmail.com on 10 Jan 2010 at 2:30

GoogleCodeExporter commented 9 years ago

I was thinking that we needed to do this, as PHPMarkdown does the same. Makes 
WAY
more sense.

Excellent and very helpful contribution, checked in as r96

(and yes, I removed all the /t from regex and the codeblockevaluator call as 
well --
we are TAB-FREE!)

Original comment by wump...@gmail.com on 11 Jan 2010 at 2:29

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I think we should merge the newline normalization code and the detab code into 
the same 
loop -- don't you? Can you give it a shot?

Original comment by wump...@gmail.com on 11 Jan 2010 at 4:53

GoogleCodeExporter commented 9 years ago

Yep, will do.

Original comment by Shio...@gmail.com on 11 Jan 2010 at 5:06

joshids / markdownsharp

Detab function performance and \t matching in regexes #22