Closed GoogleCodeExporter closed 9 years ago
I'm with you halfway. Looking at the patch, there are not going to be regular
expressions being compiled always, even when they will not be used. I think
it's probably only worth doing ones that always get used, or compile only when
they will be used.
Your patch seems to include white space changes too, which would need a clean
up before applying (I prefer to do cosmetic fixes like trailing spaces in a
separate commit). I might take some portions of the patch however for now.
thanks.
Original comment by exob...@gmail.com
on 27 Mar 2014 at 2:13
Sorry about the white space, my text editor automatically removes white space
at the end of lines.
About the compiled expressions, my analysis[1] shows that the static patterns
are being (re-)compiled many hundreds of times, even when they are not
required. Although the elapsed time isn't particularly excessive, it is
unnecessary work that adds up and can be relatively easily avoided.
The series_match() and parse_entry_id() patterns are being re-compiled many
hundreds of times whenever the user navigates into a new list, eg. browse into
BBC One, or browse into Categories -> Children, and you'll see
re.compile()/re.findall() called many hundreds of times.
The other patterns addressed by the patch (parse_asx(), download_subtitles())
are less likely to be (re-)compiled frequently however I changed those for the
purpose of remaining consistent with all other regular expressions.
As a rule, I would suggest compiling static patterns only once whenever
possible, and then re-using the previously compiled RE object as required.
From my analysis, although the "re" package maintains a cache of recently
compiled patterns, looking up the pattern in the re compilation cache can
sometimes take longer than compiling a new pattern, so simply avoiding repeated
re.compile() calls can be beneficial.
Please try the analysis yourself (it should work even if you don't have the
regex package installed) - you might be surprised how often the re.compile()
method is being called.
There are additional optimisations that emerge from the analysis, such as the
use of direct calls to findall() with a static pattern, that then require
compilation (or slow cache lookup) and which would be better replaced by a
single re.compile() then calling findall() on the compiled object. This kind of
optimisation I was going to leave for a subsequent patch.
Here's a sample of my analysis, literally a few seconds in iplayer drilling
down to a series on BBC One. Some of the patterns you may recognise.
The number on the left is the frequency (ie. called 454 times), next is the
method (d.foo indicates re.foo() has been called directly, and c.foo is a
compiled pattern on which the foo() method is called), and finally the pattern
string that was used for the method.
454 d.findall <updated[^>]*>(.*?)</updated>
454 d.findall <title[^>]*>(.*?)</title>
454 d.findall <link rel="self" .*title=".*pisode *([0-9]+?)
454 d.findall <link rel="related" href=".*microsite.*title="(.*?)" />
454 d.findall <id[^>]*>(.*?)</id>
454 d.findall <content[^>]*>(.*?)</content>
454 d.findall <category[^>]*term="(.*?)"[^>]*>
454 c.findall PIPS:([0-9a-z]{8})
310 d.findall <link rel="self" .*title="([0-9]+?)\.
2 d.findall <entry>(.*?)</entry>
1. http://forum.xbmc.org/showthread.php?tid=170921&pid=1664571#pid1664571
Original comment by n...@nmacleod.com
on 27 Mar 2014 at 3:16
I wasn't disagreeing - merely pointing out that some are not used often, such
as the asx one for example.
To further the discussion, I bumped into
https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile
(check top answer), which suggests compiled regular expressions are cached
anyway.
Original comment by exob...@gmail.com
on 27 Mar 2014 at 6:18
(Ill post this to the forum thread for discussion, as it doesn't correlate with
the timings etc, but I thought interesting)
Original comment by exob...@gmail.com
on 27 Mar 2014 at 6:22
Yes there is a cache in the re package, and based on the timings it can
sometimes be as slow as the original pattern compilation, so if the entire
compile() can be avoided that would be a win.
In other words, the cache still has a cost and isn't free.
The original purpose of this investigation was to see if the regex[1] package
showed any performance gain compared with re, and it turns out the pattern
cache processing in regex is a bit more complex and involved than that that in
re, resulting in regex calls to compile() for a previously cached pattern quite
often being slower than compiling a new pattern. And new pattern compilation in
regex is generally slower than re, as a rule.
I'm not entirely sure what the plans are for regex, but I get the impression
that it might be being lined up as a drop in replacement for re, so again
avoiding the compile() call completely when it's not required would mean less
of a headache if regex ever does replace re, since regex.compile() is slower
than re.compile().
I know I'm being a bit anal/pedantic (delete as appropriate) but I do think
these little wins will eventually add up over the long run. A few percent saved
here, another few percent saved there, all helps in the end. :)
Will continue the discussion in the forum thread - thanks.
1. https://pypi.python.org/pypi/regex
Original comment by n...@nmacleod.com
on 27 Mar 2014 at 6:46
Thanks for the info. I'll have a read about regex too. I don't think it's too
pedantic. Well written optimised code that is still readable is always good :)
Original comment by exob...@gmail.com
on 27 Mar 2014 at 7:00
Applied in r158. thanks.
Original comment by exob...@gmail.com
on 29 Apr 2014 at 12:37
Original issue reported on code.google.com by
n...@nmacleod.com
on 21 Mar 2014 at 2:13Attachments: