jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

Segfault (exit code 138) - Unicode flag #111

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Unfortunately, I can't factor out 100% reproduce procedure. This appears 
randomly seems depending on almost any change in the project.
Anyway, here is definitely connected things:
1. The most important - compile regex with U flag. Bug doesn't appear w/o this. 
Alternatively, source string can be unicode - this turns on the flag 
automatically.
2. It's connected with currently existing string literals somehow. My regex 
sources are compiled from files. And I found one of connected files. Once I 
remove it or rename - bug doesn't appear. What's strange- it doesn't depend on 
file content.. even empty file, if mentioned by name, causes the bug. 
3. Bug doesn't appear always, and same input doesn't guarantee the reproducing. 
As I said, it depends on everything. Even adding simple print or string 
encoding line might influence.. name of file!!! 

Currently I solved the issue by removing U flag and replacing all unicode chars 
in regexes with \u.... escapes (fortunately I don't need case insensitivity for 
those chars)

Python is 2.7.5+, 64bit
OS is Ubuntu Desktop 64bit, little endian.

I'm open to perform any additional checks, just email me.

Thanks in advance!

Original issue reported on code.google.com by arseniy....@gmail.com on 5 May 2014 at 2:12

GoogleCodeExporter commented 9 years ago
Additional info:
another partially influencing thing is negative lookbehind + alternation:

(?<!foo )bar|baz

This influences on which of tons of my regexes would crash. Usually crashes the 
one with this 2 features: lookbehind and alternation. Once I remove this from 
the regex, bug appears on another regex.

Original comment by arseniy....@gmail.com on 5 May 2014 at 2:18

GoogleCodeExporter commented 9 years ago
I don't have enough information.

It might be that in certain circumstances it tries to read outside the target 
string, the result of which is unpredictable, but none of the testing I've done 
shows any such problem.

I need a regex and target string to test. You could email me directly if 
necessary.

Original comment by re...@mrabarnett.plus.com on 5 May 2014 at 5:08

GoogleCodeExporter commented 9 years ago
Additional info: bug doesn't appear immediately. I have millions of different 
target strings. Crash appears after calling reg.finditer() 100-1000 times on 
some combination of target strings. I tried to log and factor out the 
combination- and that's how I noticed the connection with existing string 
objects: if I add/remove logging/encoding/printing- this influences somehow and 
bug doesn't reproduce.

Anyway, I will try to create crashing module + several source/target files. The 
problem is NDA which forbids me to post exact source/target strings, so I have 
to factor out minimum and then change it somehow.

Original comment by arseniy....@gmail.com on 6 May 2014 at 5:04