Avoid repeated re.compile() calls - can be an advantage on Raspberry Pi and other low powered systems

GoogleCodeExporter commented 9 years ago

I'm attaching a small patch that will avoid repeated re.compile() calls for a 
couple of static patterns that are used very often.

In iPlayer2 2.5.4, there are two patterns that are re-compiled potentially 
hundreds of times when drilling down into channels and shows. Each compile will 
cost - on average, on a 1GHz R-Pi - about 180us per compile.

This might not sound a lot, but when re.compile() is called over 500 times for 
each of these two patterns, it's wasted time that quickly adds up.

Avoiding these repeated compiles can therefore be beneficial over the longer 
term, and the code change to achieve this is relatively trivial.

In general, all RE patterns should be compiled, and compiled only once at most, 
to avoid taking the hit on repeated compilation of static patterns. 

While the re package does have a small compile cache, finding the pattern to be 
compiled in the cache can sometimes take as long as compiling a new pattern 
from scratch so when possible don't compile unless you have to!

The two fixes in the patch are low hanging fruit, there are more expensive 
re.compiles in http2lib that could be optimised away but that requires more 
significant changes such as storing the compiled patterns outside of the 
http2lib module, or in a global cache object.

Original issue reported on code.google.com by n...@nmacleod.com on 21 Mar 2014 at 2:13

Attachments:

iplayer_254_regex.patch

GoogleCodeExporter commented 9 years ago

I'm with you halfway. Looking at the patch, there are not going to be regular 
expressions being compiled always, even when they will not be used. I think 
it's probably only worth doing ones that always get used, or compile only when 
they will be used. 

Your patch seems to include white space changes too, which would need a clean 
up before applying (I prefer to do cosmetic fixes like trailing spaces in a 
separate commit). I might take some portions of the patch however for now. 
thanks.

Original comment by exob...@gmail.com on 27 Mar 2014 at 2:13

GoogleCodeExporter commented 9 years ago

Sorry about the white space, my text editor automatically removes white space 
at the end of lines.

About the compiled expressions, my analysis[1] shows that the static patterns 
are being (re-)compiled many hundreds of times, even when they are not 
required. Although the elapsed time isn't particularly excessive, it is 
unnecessary work that adds up and can be relatively easily avoided.

The series_match() and parse_entry_id() patterns are being re-compiled many 
hundreds of times whenever the user navigates into a new list, eg. browse into 
BBC One, or browse into Categories -> Children, and you'll see 
re.compile()/re.findall() called many hundreds of times.

The other patterns addressed by the patch (parse_asx(), download_subtitles()) 
are less likely to be (re-)compiled frequently however I changed those for the 
purpose of remaining consistent with all other regular expressions.

As a rule, I would suggest compiling static patterns only once whenever 
possible, and then re-using the previously compiled RE object as required.

From my analysis, although the "re" package maintains a cache of recently 
compiled patterns, looking up the pattern in the re compilation cache can 
sometimes take longer than compiling a new pattern, so simply avoiding repeated 
re.compile() calls can be beneficial.

Please try the analysis yourself (it should work even if you don't have the 
regex package installed) - you might be surprised how often the re.compile() 
method is being called.

There are additional optimisations that emerge from the analysis, such as the 
use of direct calls to findall() with a static pattern, that then require 
compilation (or slow cache lookup) and which would be better replaced by a 
single re.compile() then calling findall() on the compiled object. This kind of 
optimisation I was going to leave for a subsequent patch.

Here's a sample of my analysis, literally a few seconds in iplayer drilling 
down to a series on BBC One. Some of the patterns you may recognise.

The number on the left is the frequency (ie. called 454 times), next is the 
method (d.foo indicates re.foo() has been called directly, and c.foo is a 
compiled pattern on which the foo() method is called), and finally the pattern 
string that was used for the method.

    454 d.findall       <updated[^>]*>(.*?)</updated>
    454 d.findall       <title[^>]*>(.*?)</title>
    454 d.findall       <link rel="self" .*title=".*pisode *([0-9]+?)
    454 d.findall       <link rel="related" href=".*microsite.*title="(.*?)" />
    454 d.findall       <id[^>]*>(.*?)</id>
    454 d.findall       <content[^>]*>(.*?)</content>
    454 d.findall       <category[^>]*term="(.*?)"[^>]*>
    454 c.findall       PIPS:([0-9a-z]{8})
    310 d.findall       <link rel="self" .*title="([0-9]+?)\.
      2 d.findall       <entry>(.*?)</entry>

1. http://forum.xbmc.org/showthread.php?tid=170921&pid=1664571#pid1664571

Original comment by n...@nmacleod.com on 27 Mar 2014 at 3:16

GoogleCodeExporter commented 9 years ago

I wasn't disagreeing - merely pointing out that some are not used often, such 
as the asx one for example.

To further the discussion, I bumped into 

https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile

(check top answer), which suggests compiled regular expressions are cached 
anyway.

Original comment by exob...@gmail.com on 27 Mar 2014 at 6:18

GoogleCodeExporter commented 9 years ago

(Ill post this to the forum thread for discussion, as it doesn't correlate with 
the timings etc, but I thought interesting)

Original comment by exob...@gmail.com on 27 Mar 2014 at 6:22

GoogleCodeExporter commented 9 years ago

Yes there is a cache in the re package, and based on the timings it can 
sometimes be as slow as the original pattern compilation, so if the entire 
compile() can be avoided that would be a win.

In other words, the cache still has a cost and isn't free.

The original purpose of this investigation was to see if the regex[1] package 
showed any performance gain compared with re, and it turns out the pattern 
cache processing in regex is a bit more complex and involved than that that in 
re, resulting in regex calls to compile() for a previously cached pattern quite 
often being slower than compiling a new pattern. And new pattern compilation in 
regex is generally slower than re, as a rule.

I'm not entirely sure what the plans are for regex, but I get the impression 
that it might be being lined up as a drop in replacement for re, so again 
avoiding the compile() call completely when it's not required would mean less 
of a headache if regex ever does replace re, since regex.compile() is slower 
than re.compile().

I know I'm being a bit anal/pedantic (delete as appropriate) but I do think 
these little wins will eventually add up over the long run. A few percent saved 
here, another few percent saved there, all helps in the end. :)

Will continue the discussion in the forum thread - thanks.

1. https://pypi.python.org/pypi/regex

Original comment by n...@nmacleod.com on 27 Mar 2014 at 6:46

GoogleCodeExporter commented 9 years ago

Thanks for the info. I'll have a read about regex too. I don't think it's too 
pedantic. Well written optimised code that is still readable is always good :)

Original comment by exob...@gmail.com on 27 Mar 2014 at 7:00

GoogleCodeExporter commented 9 years ago

Applied in r158. thanks.

Original comment by exob...@gmail.com on 29 Apr 2014 at 12:37

Changed state: Fixed

joolswills / plugin.video.iplayer

Avoid repeated re.compile() calls - can be an advantage on Raspberry Pi and other low powered systems #144