Open GoogleCodeExporter opened 8 years ago
Original comment by collinw
on 14 Apr 2009 at 11:38
As discussed in the Chromium blog post, even '|' in Python regexps, which I
would
guess is extremely common, requires alternatives to be tried from left to
right.
Perhaps there's a clever way to implement that within an NFA, but it's not
obvious.
If you leave out alternation, then the only interesting operators you have left
are
character classes and repetition. Special casing those may be worth it, though,
especially since regexps are composed out of units like that.
Original comment by reid.kle...@gmail.com
on 15 Apr 2009 at 10:06
Yeah, I saw your comment on that blog post. There's a lot more details (like
the bit
about | and &) in
http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Regular_Expressions,
I just
didn't want to duplicate it all here.
Original comment by collinw
on 15 Apr 2009 at 10:13
Original comment by collinw
on 27 May 2009 at 9:44
Original comment by collinw
on 27 May 2009 at 11:06
Original comment by collinw
on 27 May 2009 at 11:07
Moving to Fredrik, since he's volunteered to work on this.
Original comment by collinw
on 29 May 2009 at 4:12
Hi, for my own amusement I'm playing around with implementing regular
expressions on
top of LLVM. Right now I'm taking an NFA/DFA approach but if there's interest I
might
work on a backtracking implementation too. It seems to me that if you're going
to use
LLVM as JIT for Python code it might make sense as your JIT for regular
expression
code too. I'll keep an eye on your work and post if I complete anything that
might be
useful.
Original comment by i...@mckellar.org
on 20 Jun 2009 at 4:21
> it might make sense as your JIT for regular expression code
Absolutely, and that's exactly what we plan to do for the first iteration - the
current
regex engine (SRE) is basically a simple "word code" engine, and generating
LLVM IR
instead of word code from SRE's intermediate representation should be pretty
straightforward.
Original comment by fredrik.lundh
on 20 Jun 2009 at 4:30
Alright, so I've been playing with this a bit and have started building a regex
implementation on top of LLVM. It builds on SRE's parser so that it will accept
exactly the same grammar as the existing implementation. It uses LLVM to JIT and
reuses a few of the unladen-swallow LLVM tools like the optimization support and
PyGlobalLlvmData to get the ExecutionEngine. It's half C++ and half Python. The
C++
side is for compilation and the Python side provides interface compatibility
with SRE.
It's pretty incomplete. The Python interface just supports matching right now.
The
only flag it respects is DOTALL. It supports greedy repetition, literals,
character
ranges, wildcards (.) and subexpression extraction. The rest shouldn't be too
hard.
I'm hoping to get complete SRE compatibility, then look at improving the
quality of
the bytecode being generated, then hopefully look at using Thompson DFA
techniques
for regular expressions that will allow it. I previously built an LLVM Thompson
DFA
fnmatch implementation for fun:
http://ianloic.com/2009/07/15/an-fnmatch-implementation-using-finite-state-machi
nes-and-llvm/
The code is pretty messy right now. Most things have been renamed a couple of
times
and need to be renamed a couple more times. I also haven't looked at
performance at all.
Is anyone else working on this? Should I take this bug? I'm working mostly
offline
since I'm staying on the top of a mountain in the French Alps right now so I'm
not on
email very often.
Original comment by i...@mckellar.org
on 27 Jul 2009 at 8:51
Attachments:
I'm also working on this, but have been slowed down by work and vacations, and
you seem
to be a bit closer to something useful than I am. Feel free to take the lead
here; I
can help out with tweaking, turning, and testing.
Original comment by fredrik.lundh
on 27 Jul 2009 at 8:58
(and feel free to mail me off-line if you want to discuss something in more
detail;
you'll reach me at fredrik at pythonware.com as usual).
Original comment by fredrik.lundh
on 27 Jul 2009 at 8:59
re.sub("turning", "tuning", message)
Original comment by fredrik.lundh
on 27 Jul 2009 at 11:09
ian.l.mckellar: cool! I'll try to play with your patch later this week.
I'm somewhat skeptical that a Thompson-based approach will be applicable to the
majority of Python regexes, but I'm willing to be proven wrong. I know the V8
team
looked at Thompson NFAs for their new Irregexp engine, but found that they
would
only apply to a tiny slice of the regexes they see on the Internet, and I think
that
Python regex semantics are closer to Javascript regexes than to the semantics
Thompson was targeting.
I know you said you haven't looked at performance yet, but when you do, our
benchmark suite has a few different regex benchmarks in it already (see the
Benchmarks wiki page). I just added a regex_compile benchmark so we can see how
the LLVM-targeting compiler compares with the existing compiler. `./perf.py -r
-b
regex` will run all the regex benchmarks.
Original comment by collinw
on 28 Jul 2009 at 9:18
Hi, as I mentioned I've been traveling so I've been offline and I'm only
hacking every once in a while.
Mostly when I've got some downtime. I just spent a few days in Tuscany and
ended up staying up late to get
my regex compiler basically working[*].
I've put my fork/branch up on github at:
http://github.com/ianloic/unladen-swallow/tree/llvm-re
I'm using git-svn to track mainline unladen swallow. You should be able to
build it from the llvm-re
branch of my git repository and generate a patch by diffing that branch against
master.
Right now if you "import re" you get SRE but if you "import llvmre as re" you
get my stuff. I'll need to
do some performance testing next. I'll be surprised if it's great. The codegen
is function-heavy and I
haven't worked out how to get the optimizer to inline as much as I want.
I think there are a few memory leaks in there too. Nothing fundamental but I
haven't gone through the code
and tidied it up enough yet. I'm sure it's possible to crash the module by
passing the wrong arguments
into the native module, I need to check arguments more strictly, but luckily
the interface is pretty
small.
I don't feel confident that I will have this code ready for release by the end
of Q3 but I've got a few
chunks of time over November, December and January when I know I'll have time
to finish this off.
[*] it passes most of the standard Python re tests but doesn't correctly
backtrack group references,
doesn't do backward assertions, and a couple of other things. generally
real-world regular expressions
will be correct.
Original comment by i...@mckellar.org
on 13 Sep 2009 at 3:08
And of course, I pushed before doing a clean rebuild and everything is broken.
Some
LLVM changes (around LLVMContext, mostly) broke everything and now that I've
fixed it
there are some memory leaks that are so bad they cause the performance tests to
crash.
I've pushed a fix that makes llvmre build (and it's now what you get when you
"import
re"), but I need to pull out valgrind and identify what's leaking so badly.
I'll
update when there's code out there that I'm not so ashamed of.
Original comment by i...@mckellar.org
on 15 Sep 2009 at 8:56
Original comment by collinw
on 26 Sep 2009 at 12:05
Original comment by collinw
on 6 Jan 2010 at 11:41
Original issue reported on code.google.com by
collinw
on 14 Apr 2009 at 11:30