Closed GoogleCodeExporter closed 9 years ago
That bug is in 'fast_string_search', and there's a similar one in
'fast_string_search_ign'. I'll need to have a closer look to see whether there
are any others.
By the way, you don't need to set the 'concurrent' option if you're working on
unicode or str because the regex module knows that they're immutable and
therefore won't change unexpectedly, so it can safely release the GIL.
What is that "max_iterations=1000000000" and "match.iterations()" stuff?
Original comment by re...@mrabarnett.plus.com
on 3 Jun 2013 at 5:15
hey thanks a bunch for your fast response! Yup I spotted the one in
fast_string_search_ign after I made my bug report.
I didn't know about the the concurrent stuff so I also appreciate you
clearifying that.
I had a couple of runaway regex's which I couldn't get a good handle on so I
made a controllable threshold for the number of iterations allowed against
match as part of a single match, search etc operation. I found it very useful
in guaranteeing rogue regex's wouldn't end up pegging cpus for very long
periods of time. I've included the patch file (for just the max_iterations
part) below, but I doubt it would be something you would want to include in the
release version, but who knows :).
Original comment by peterswa...@gmail.com
on 4 Jun 2013 at 12:03
Attachments:
Unfortunately those 2 fixes break the unit tests. I'm working on a proper fix
now.
I don't want to put a limit on the number of iterations because that's
dependent on the speed of the CPU. A better solution would be to have a time
limit, but I don't know of a cross-platform way of doing that, and I haven't
had a problem with long-running regexes in practice.
Original comment by re...@mrabarnett.plus.com
on 4 Jun 2013 at 12:37
Fixed (I hope!) in regex 2013-06-05.
Original comment by re...@mrabarnett.plus.com
on 5 Jun 2013 at 2:10
Yup I agree about the time limit stuff. It was really a bit of a hack I had to
do just because one in about 10 million was blowing up (probably a poorly
designed regex on my part). Thanks so much for looking into this and fixing it
so quickly! This is the fastest fix I think I've ever seen! I'm glad I'm using
this module! :)
Original comment by peterswa...@gmail.com
on 5 Jun 2013 at 7:52
I just downloaded the latest version and tested and I think there may have been
a regression (not entirely sure). Or perhaps my regex is just garbage to begin
with. :(
I only included a minimalistic regex in my original bug report because I wanted
to avoid giving you unrelated stuff that would waste your time.
When I tried the latest release it is failing on my full regex run against the
file I attached above. It was working as of my modified version of 2013-03-11
with the change I pasted above.
It just fails to match after 1 second.
Any insights you could provide on why it would no longer match when it used to
would be appreciated. I just wonder if perhaps it was a regression.
import regex
auction_page_re = regex.compile(ur'(?>.*?<h1 property="auction:Title")' +
ur'(?>(?<=<h1 property="auction:Title")>(?P<title>[^<]+)</h1>)' +
ur'(?>.*?<a href="http://category.auctions.yahoo.co.jp/list/)' +
ur'(?>(?<=<a href="http://category.auctions.yahoo.co.jp/list/)[^/]+/(?P<category_id>\d+)/[^"]+">[^<]+</a>\s*</div>\s*<!--\s*/CATEGORYPATH\s*-->)' +
ur'((?>.*?<p property="auction:Price"|<p property="auction:BidOrBuyPrice"|<th >\u6b8b\u308a\u6642\u9593</th>)' +
ur'(?>((?<=<p property="auction:Price")>(?P<current_price>[\d,]*?) \u5186</p>)?))?' +
ur'((?>.*?<p property="auction:BidOrBuyPrice"|<th >\u6b8b\u308a\u6642\u9593</th>)' +
ur'(?>((?<=<p property="auction:BidOrBuyPrice")>(?P<bid_or_buy_price>[\d,]*?) \u5186</p>)?))?' +
ur'(?>.*?<th >\u6b8b\u308a\u6642\u9593</th>)' +
ur'(?>(?<=<th >\u6b8b\u308a\u6642\u9593</th>)[^<]*<td>\uff1a </td>[^<]*<td>(<font[^>]*>)?(?P<time_remaining>[^<]+)<)' +
ur'(?>.*?<b property="auction:Bids")' +
ur'(?>(?<=<b property="auction:Bids")>(?P<bids>[\d,]*?))' +
ur'(?>.*?<td property="auction:Quantity")' +
ur'(?>(?<=<td property="auction:Quantity") content="[^"]*">(?P<quantity>[\d,]+)\s*(?P<quantity_note>[^<]*)?\s*</td>)' +
ur'(?>.*?<td property="auction:StartPrice")' +
ur'(?>(?<=<td property="auction:StartPrice")>(?P<start_price>[\d,]*?) \u5186</td>)' +
ur'(?>.*?<td property="auction:StartTime")' +
ur'(?>(?<=<td property="auction:StartTime")>((?P<start_time_month>[\d]+)\u6708 )?((?P<start_time_day>[\d]+)\u65e5 )?((?P<start_time_hour>[\d]+)\u6642 )?((?P<start_time_minute>[\d]+)\u5206)?</td>)' +
ur'(?>.*?<td property="auction:EndTime")' +
ur'(?>(?<=<td property="auction:EndTime")>((?P<end_time_month>[\d]+)\u6708 )?((?P<end_time_day>[\d]+)\u65e5 )?((?P<end_time_hour>[\d]+)\u6642 )?((?P<end_time_minute>[\d]+)\u5206)?<br>)' +
ur'((?>.*?<th >\u5165\u672d\u8005\u8a55\u4fa1\u5236\u9650</th>|<th >\u65e9\u671f\u7d42\u4e86</th>|<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
ur'(?>((?<=<th >\u5165\u672d\u8005\u8a55\u4fa1\u5236\u9650</th>)\s*<td>\uff1a </td>\s*<td>\u3042\u308a\s*<span>(?P<bidder_restriction>[^<]+)</span>\s*</td>)?))?' +
ur'((?>.*?<th >\u65e9\u671f\u7d42\u4e86</th>|<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
ur'(?>((?<=<th >\u65e9\u671f\u7d42\u4e86</th>)\s*<td>\uff1a </td>\s*<td>(?P<early_finish_enabled>(\u3042\u308a|\u306a\u3057))</td>)?))?' +
ur'((?>.*?<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
ur'(?>((?<=<th >\u81ea\u52d5\u5ef6\u9577</th>)\s*<td>\uff1a </td>\s*<td>(?P<auto_extension_enabled>(\u3042\u308a|\u306a\u3057))</td>)?))?' +
ur'(?>.*?<td property="auction:AuctionID")' +
ur'(?>(?<=<td property="auction:AuctionID")>(?P<auction_id>\w[\d]+)</td>)' +
ur'((?>.*?<td property="auction:ItemStatus"|<td property="auction:ReturnPolicy"|<b property="auction:SellerID")' +
ur'(?>((?<=<td property="auction:ItemStatus") content="[^"]*">\s*(?P<item_status>[^\s<]*?)\s*</td>)?))?' +
ur'((?>.*?<td property="auction:ReturnPolicy"|<b property="auction:SellerID")' +
ur'(?>((?<=<td property="auction:ReturnPolicy") content="[^"]*">\s*(?P<return_policy>[^\s<]*?)\s*</td>)?))?' +
ur'(?>.*?<b property="auction:SellerID")' +
ur'(?>(?<=<b property="auction:SellerID")>(?P<seller_name>[^<]+))' +
ur'(?>.*?<!--\s*IMAGEINFO\s*-->)' +
ur'(?>.*?<div class="pts01")' +
ur'(?>(?<=<div class="pts01")>\s*(?P<image_description>.*)\s*</div>\s*</div>\s*<div class="untFoot">)' +
ur'(?>.*?<!--\s*ITEMINFO\s*-->)' +
ur'(?>.*?<div class="modUsrPrv")' +
ur'(?>(?<=<div class="modUsrPrv")>\s*(?P<description>.*)\s*</div>\s*<div class="untFoot">)' +
ur'((?>.*?<a name="updatehistory"></a>)' +
ur'(?>((?<=<a name="updatehistory"></a>)(?>.*?<table[^>]*>)\s*(?P<update_history>.*?)</table>)?))?' +
ur'.*', regex.DOTALL)
import codecs
import glob
import time
import os
import traceback
match = None
data =
codecs.open('\\fails\\ignore_safe\\2013-06-03-08-38-49_auction-page-data_long-ru
nning-regex-backtrack-timeout-auction-data.html', 'r', 'utf-8').read()
start = time.time()
match = auction_page_re.match(data)
print time.time() - start
match.groupdict()
Original comment by peterswa...@gmail.com
on 5 Jun 2013 at 8:30
I can create a new ticket but since the module isn't crashing I wanted to
confirm it was indeed a regression and not a bug that was making it work to
begin with...
I'm combining some atomic logic with lookbehinds and "?" operators. So there's
a lot of things that could go wrong ;)
Original comment by peterswa...@gmail.com
on 5 Jun 2013 at 8:31
I've tried it in EditPad (my favourite editor), and, after a long wait, it too
failed to find a match.
I'll see what Perl says and report back if it does match.
Anyway, regex is probably the wrong tool for parsing HTML, the recommended tool
is BeautifulSoup.
Some comments on your regex:
There are places where you match something and then unnecessarily use a
look-behind to check that you matched it, for example, ur'(?>.*?<h1
property="auction:Title")(?>(?<=<h1
property="auction:Title")>(?P<title>[^<]+)</h1>)', which could be simplified to
ur'(?>.*?<h1 property="auction:Title")(?>>(?P<title>[^<]+)</h1>)', then
ur'(?>.*?<h1 property="auction:Title">)(?>(?P<title>[^<]+)</h1>)'.
regex.match(r'.*?abc', text, flags=regex.DOTALL) is basically the same as
regex.search(r'abc', text, flags=regex.DOTALL).
Adding the regex.DEBUG flag might give you a better idea of the structure of
the regex and where you could improve it.
Original comment by re...@mrabarnett.plus.com
on 5 Jun 2013 at 4:09
For the record, Perl also says no match.
Original comment by re...@mrabarnett.plus.com
on 5 Jun 2013 at 5:00
Thanks so much for all the help and the tip on BeautifulSoap.
It did indeed turn out it was my regex that was funked up and just worked
before out of chance. Changing afew .* to non greedy .*? fixed the issue!
Seems like this updated version is running faster too although it could just be
in my head :).
ur'(?>.*?<!--\s*IMAGEINFO\s*-->)' +
ur'(?>.*?<div class="pts01")' +
ur'(?>(?<=<div class="pts01")>\s*(?P<image_description>.*?)\s*</div>\s*</div>\s*<div class="untFoot">)' +
ur'(?>.*?<!--\s*ITEMINFO\s*-->)' +
ur'(?>.*?<div class="modUsrPrv")' +
ur'(?>(?<=<div class="modUsrPrv")>\s*(?P<description>.*?)\s*</div>\s*<div class="untFoot">)' +
ur'((?>.*?<a name="updatehistory"></a>)' +
ur'(?>((?<=<a name="updatehistory"></a>)(?>.*?<table[^>]*>)\s*(?P<update_history>.*?)</table>)?))?' +
Original comment by peterswa...@gmail.com
on 5 Jun 2013 at 10:15
Original issue reported on code.google.com by
peterswa...@gmail.com
on 3 Jun 2013 at 8:40Attachments: