running the following regex causes a segfault

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. running the following regex against the attached html causes a segfault

import regex
auction_page_re = regex.compile(ur'(?>.*?<p 
property="auction:Price>(?P<current_price>[\d,]*?) \u5186</p>)?' +
                                ur'(?>.*?<p property="auction:BidOrBuyPrice">(?P<bid_or_buy_price>[\d,]*?) \u5186</p>)?' +
                                ur'.*', regex.DOTALL)
import codecs
import glob
import time
import os
import traceback
match = None
data = 
codecs.open('\\fails\\ignore_safe\\2013-06-03-08-38-49_auction-page-data_long-ru
nning-regex-backtrack-timeout-auction-data.html', 'r', 'utf-8').read()
start = time.time()
match = auction_page_re.match(data, concurrent=True, max_iterations=1000000000)
print time.time() - start
match.groupdict()
match.iterations()

What is the expected output? What do you see instead?
I expected it to work.

What version of the product are you using? On what operating system?
2013-03-11

Please provide any additional information below.
The error appears to be centered in around the following lines.

    while (text_pos <= limit) {
        Py_UCS4 ch;

        ch = char_at(text, text_pos + last_pos);

char_at is being invoked with text_pos = 247775, last_pos = 26 and limit = 
247786. 

This looked like a simple overflow so I changed the logic too the following, 
and the regex isn't crashing python any more but it may be incorrect behaviour.

    while (text_pos + last_pos <= limit) {
        Py_UCS4 ch;

        ch = char_at(text, text_pos + last_pos);

Original issue reported on code.google.com by peterswa...@gmail.com on 3 Jun 2013 at 8:40

Attachments:

2013-06-03-08-38-49_auction-page-data_long-running-regex-backtrack-timeout-auction-data.html

GoogleCodeExporter commented 9 years ago

That bug is in 'fast_string_search', and there's a similar one in 
'fast_string_search_ign'. I'll need to have a closer look to see whether there 
are any others.

By the way, you don't need to set the 'concurrent' option if you're working on 
unicode or str because the regex module knows that they're immutable and 
therefore won't change unexpectedly, so it can safely release the GIL.

What is that "max_iterations=1000000000" and "match.iterations()" stuff?

Original comment by re...@mrabarnett.plus.com on 3 Jun 2013 at 5:15

GoogleCodeExporter commented 9 years ago

hey thanks a bunch for your fast response! Yup I spotted the one in 
fast_string_search_ign after I made my bug report. 

I didn't know about the the concurrent stuff so I also appreciate you 
clearifying that.

I had a couple of runaway regex's which I couldn't get a good handle on so I 
made a controllable threshold for the number of iterations allowed against 
match as part of a single match, search etc operation. I found it very useful 
in guaranteeing rogue regex's wouldn't end up pegging cpus for very long 
periods of time. I've included the patch file (for just the max_iterations 
part) below, but I doubt it would be something you would want to include in the 
release version, but who knows :).

Original comment by peterswa...@gmail.com on 4 Jun 2013 at 12:03

Attachments:

Python2.patch

GoogleCodeExporter commented 9 years ago

Unfortunately those 2 fixes break the unit tests. I'm working on a proper fix 
now.

I don't want to put a limit on the number of iterations because that's 
dependent on the speed of the CPU. A better solution would be to have a time 
limit, but I don't know of a cross-platform way of doing that, and I haven't 
had a problem with long-running regexes in practice.

Original comment by re...@mrabarnett.plus.com on 4 Jun 2013 at 12:37

GoogleCodeExporter commented 9 years ago

Fixed (I hope!) in regex 2013-06-05.

Original comment by re...@mrabarnett.plus.com on 5 Jun 2013 at 2:10

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Yup I agree about the time limit stuff. It was really a bit of a hack I had to 
do just because one in about 10 million was blowing up (probably a poorly 
designed regex on my part). Thanks so much for looking into this and fixing it 
so quickly! This is the fastest fix I think I've ever seen! I'm glad I'm using 
this module! :)

Original comment by peterswa...@gmail.com on 5 Jun 2013 at 7:52

GoogleCodeExporter commented 9 years ago

I just downloaded the latest version and tested and I think there may have been 
a regression (not entirely sure). Or perhaps my regex is just garbage to begin 
with. :(

I only included a minimalistic regex in my original bug report because I wanted 
to avoid giving you unrelated stuff that would waste your time.

When I tried the latest release it is failing on my full regex run against the 
file I attached above. It was working as of my modified version of 2013-03-11 
with the change I pasted above.

It just fails to match after 1 second.
Any insights you could provide on why it would no longer match when it used to 
would be appreciated. I just wonder if perhaps it was a regression.

import regex
auction_page_re = regex.compile(ur'(?>.*?<h1 property="auction:Title")' +
                                ur'(?>(?<=<h1 property="auction:Title")>(?P<title>[^<]+)</h1>)' +
                                ur'(?>.*?<a href="http://category.auctions.yahoo.co.jp/list/)' +
                                ur'(?>(?<=<a href="http://category.auctions.yahoo.co.jp/list/)[^/]+/(?P<category_id>\d+)/[^"]+">[^<]+</a>\s*</div>\s*<!--\s*/CATEGORYPATH\s*-->)' +
                                ur'((?>.*?<p property="auction:Price"|<p property="auction:BidOrBuyPrice"|<th >\u6b8b\u308a\u6642\u9593</th>)' +
                                ur'(?>((?<=<p property="auction:Price")>(?P<current_price>[\d,]*?) \u5186</p>)?))?' +
                                ur'((?>.*?<p property="auction:BidOrBuyPrice"|<th >\u6b8b\u308a\u6642\u9593</th>)' +
                                ur'(?>((?<=<p property="auction:BidOrBuyPrice")>(?P<bid_or_buy_price>[\d,]*?) \u5186</p>)?))?' +
                                ur'(?>.*?<th >\u6b8b\u308a\u6642\u9593</th>)' +
                                ur'(?>(?<=<th >\u6b8b\u308a\u6642\u9593</th>)[^<]*<td>\uff1a </td>[^<]*<td>(<font[^>]*>)?(?P<time_remaining>[^<]+)<)' +
                                ur'(?>.*?<b property="auction:Bids")' +
                                ur'(?>(?<=<b property="auction:Bids")>(?P<bids>[\d,]*?))' +
                                ur'(?>.*?<td property="auction:Quantity")' +
                                ur'(?>(?<=<td property="auction:Quantity") content="[^"]*">(?P<quantity>[\d,]+)\s*(?P<quantity_note>[^<]*)?\s*</td>)' +
                                ur'(?>.*?<td property="auction:StartPrice")' +
                                ur'(?>(?<=<td property="auction:StartPrice")>(?P<start_price>[\d,]*?) \u5186</td>)' +
                                ur'(?>.*?<td property="auction:StartTime")' +
                                ur'(?>(?<=<td property="auction:StartTime")>((?P<start_time_month>[\d]+)\u6708 )?((?P<start_time_day>[\d]+)\u65e5 )?((?P<start_time_hour>[\d]+)\u6642 )?((?P<start_time_minute>[\d]+)\u5206)?</td>)' +
                                ur'(?>.*?<td property="auction:EndTime")' +
                                ur'(?>(?<=<td property="auction:EndTime")>((?P<end_time_month>[\d]+)\u6708 )?((?P<end_time_day>[\d]+)\u65e5 )?((?P<end_time_hour>[\d]+)\u6642 )?((?P<end_time_minute>[\d]+)\u5206)?<br>)' +
                                ur'((?>.*?<th >\u5165\u672d\u8005\u8a55\u4fa1\u5236\u9650</th>|<th >\u65e9\u671f\u7d42\u4e86</th>|<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
                                ur'(?>((?<=<th >\u5165\u672d\u8005\u8a55\u4fa1\u5236\u9650</th>)\s*<td>\uff1a </td>\s*<td>\u3042\u308a\s*<span>(?P<bidder_restriction>[^<]+)</span>\s*</td>)?))?' +
                                ur'((?>.*?<th >\u65e9\u671f\u7d42\u4e86</th>|<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
                                ur'(?>((?<=<th >\u65e9\u671f\u7d42\u4e86</th>)\s*<td>\uff1a </td>\s*<td>(?P<early_finish_enabled>(\u3042\u308a|\u306a\u3057))</td>)?))?' +
                                ur'((?>.*?<th >\u81ea\u52d5\u5ef6\u9577</th>|<td property="auction:AuctionID")' +
                                ur'(?>((?<=<th >\u81ea\u52d5\u5ef6\u9577</th>)\s*<td>\uff1a </td>\s*<td>(?P<auto_extension_enabled>(\u3042\u308a|\u306a\u3057))</td>)?))?' +
                                ur'(?>.*?<td property="auction:AuctionID")' +
                                ur'(?>(?<=<td property="auction:AuctionID")>(?P<auction_id>\w[\d]+)</td>)' +
                                ur'((?>.*?<td property="auction:ItemStatus"|<td property="auction:ReturnPolicy"|<b property="auction:SellerID")' +
                                ur'(?>((?<=<td property="auction:ItemStatus") content="[^"]*">\s*(?P<item_status>[^\s<]*?)\s*</td>)?))?' +
                                ur'((?>.*?<td property="auction:ReturnPolicy"|<b property="auction:SellerID")' +
                                ur'(?>((?<=<td property="auction:ReturnPolicy") content="[^"]*">\s*(?P<return_policy>[^\s<]*?)\s*</td>)?))?' +
                                ur'(?>.*?<b property="auction:SellerID")' +
                                ur'(?>(?<=<b property="auction:SellerID")>(?P<seller_name>[^<]+))' +
                                ur'(?>.*?<!--\s*IMAGEINFO\s*-->)' +
                                ur'(?>.*?<div class="pts01")' +
                                ur'(?>(?<=<div class="pts01")>\s*(?P<image_description>.*)\s*</div>\s*</div>\s*<div class="untFoot">)' +
                                ur'(?>.*?<!--\s*ITEMINFO\s*-->)' +
                                ur'(?>.*?<div class="modUsrPrv")' +
                                ur'(?>(?<=<div class="modUsrPrv")>\s*(?P<description>.*)\s*</div>\s*<div class="untFoot">)' +
                                ur'((?>.*?<a name="updatehistory"></a>)' +
                                ur'(?>((?<=<a name="updatehistory"></a>)(?>.*?<table[^>]*>)\s*(?P<update_history>.*?)</table>)?))?' +
                                ur'.*', regex.DOTALL)

import codecs
import glob
import time
import os
import traceback
match = None
data = 
codecs.open('\\fails\\ignore_safe\\2013-06-03-08-38-49_auction-page-data_long-ru
nning-regex-backtrack-timeout-auction-data.html', 'r', 'utf-8').read()
start = time.time()
match = auction_page_re.match(data)
print time.time() - start
match.groupdict()

Original comment by peterswa...@gmail.com on 5 Jun 2013 at 8:30

GoogleCodeExporter commented 9 years ago

I can create a new ticket but since the module isn't crashing I wanted to 
confirm it was indeed a regression and not a bug that was making it work to 
begin with...
I'm combining some atomic logic with lookbehinds and "?" operators. So there's 
a lot of things that could go wrong ;)

Original comment by peterswa...@gmail.com on 5 Jun 2013 at 8:31

GoogleCodeExporter commented 9 years ago

I've tried it in EditPad (my favourite editor), and, after a long wait, it too 
failed to find a match.

I'll see what Perl says and report back if it does match.

Anyway, regex is probably the wrong tool for parsing HTML, the recommended tool 
is BeautifulSoup.

Some comments on your regex:

There are places where you match something and then unnecessarily use a 
look-behind to check that you matched it, for example, ur'(?>.*?<h1 
property="auction:Title")(?>(?<=<h1 
property="auction:Title")>(?P<title>[^<]+)</h1>)', which could be simplified to 
ur'(?>.*?<h1 property="auction:Title")(?>>(?P<title>[^<]+)</h1>)', then 
ur'(?>.*?<h1 property="auction:Title">)(?>(?P<title>[^<]+)</h1>)'.

regex.match(r'.*?abc', text, flags=regex.DOTALL) is basically the same as 
regex.search(r'abc', text, flags=regex.DOTALL).

Adding the regex.DEBUG flag might give you a better idea of the structure of 
the regex and where you could improve it.

Original comment by re...@mrabarnett.plus.com on 5 Jun 2013 at 4:09

GoogleCodeExporter commented 9 years ago

For the record, Perl also says no match.

Original comment by re...@mrabarnett.plus.com on 5 Jun 2013 at 5:00

GoogleCodeExporter commented 9 years ago

Thanks so much for all the help and the tip on BeautifulSoap.
It did indeed turn out it was my regex that was funked up and just worked 
before out of chance. Changing afew .* to non greedy .*? fixed the issue!
Seems like this updated version is running faster too although it could just be 
in my head :).

                                ur'(?>.*?<!--\s*IMAGEINFO\s*-->)' +
                                ur'(?>.*?<div class="pts01")' +
                                ur'(?>(?<=<div class="pts01")>\s*(?P<image_description>.*?)\s*</div>\s*</div>\s*<div class="untFoot">)' +
                                ur'(?>.*?<!--\s*ITEMINFO\s*-->)' +
                                ur'(?>.*?<div class="modUsrPrv")' +
                                ur'(?>(?<=<div class="modUsrPrv")>\s*(?P<description>.*?)\s*</div>\s*<div class="untFoot">)' +
                                ur'((?>.*?<a name="updatehistory"></a>)' +
                                ur'(?>((?<=<a name="updatehistory"></a>)(?>.*?<table[^>]*>)\s*(?P<update_history>.*?)</table>)?))?' +

Original comment by peterswa...@gmail.com on 5 Jun 2013 at 10:15

jamadden / mrab-regex-hg

running the following regex causes a segfault #92