Open smasher164 opened 1 year ago
OCaml is quite popular in language development, and has a dearth of lexer generators with good unicode support.
Do you think re2c will be able to provide some advantage over these? The main strengths of re2c is generating fast code, providing a flexible user interface and supporting submatch extraction based on TDFA. Do you see any of these features lacking in existing OCaml generators? Porting re2c to a functional languages looks like an interesting challenge, but I'm trying to understand if it will be useful in practice.
If one were interested in contributing support, where should one look?
First thing would be to understand what you are going to generate and construct the desired output by hand. Take a look at the standard re2c examples --- they represent various use cases and different modes of using re2c. Examples will be among the first programs ported to the new language, so it's good to have an idea of how to express each of them in OCaml.
Once you have understanding how this can be done, src/codegen
is indeed the place to look. However, note that instead of hardcoding new languages in re2c, a better way is to support syntax files that would allow us to add language support in a less intrusive way. Sometimes people contribute incomplete support for new languages (e.g. lacking some of the examples / documentation), such as in the case of D, which had to be reverted .
Do you see any of these features lacking in existing OCaml generators?
ocamllex lacks unicode support is not reentrant does not have a drop-in UI
sedlex lacks submatch extraction does not have a drop-in UI
I've found the flexible UI to re2c to be the killer feature. Being able to define:YY...
its primitives inline with the rest of my code makes it very convenient to use. This is even without considering the performance argument.
so it's good to have an idea of how to express each of them in OCaml
My gut feeling is to take advantage of OCaml's [@tailcall]
annotation, so that you basically have access to computed GOTO, like in the C backend. Additionally, the new effect stuff in the runtime is supposedly pretty efficient, so you'd have access to arbitrary control flow at that point.
a better way is to support https://github.com/skvadrik/re2c/issues/450 that would allow us to add language support in a less intrusive way
I can mention this in that issue, but my worry is that the syntax file approach can no longer special-case behavior per-target. And while I understand you may not want to commit to a stable AST, there could be a third option. re2c can have its own internal IR for language-agnostic optimizations, and lower to a separate, stable IR that users can codegen from.
FWIW, I think supporting OCaml would be extremely cool. sedlex is pretty clunky, and re2c is much nicer. maybe if #450 were implemented it would be fairly straightforward to do this.
Btw, I disagree that OCaml lacks goto
and the like. You can do such things with tail recursion, and because it has proper tail recursion, implementing fast finite automata in OCaml is actually easier than in (say) Rust. You just do a call that never returns rather than having an explicit goto
.
@smasher164 Any interest in working on this yourself (or with assistance)?
FYI I'm slowly working on #450, which (if my experiment works out) should allow adding new backends with one config file.
Aside from that, the first thing to start for any new backend is to manually construct examples of the code that re2c should generate (look at the examples to get the idea of what should be covered).
I could probably manually create OCaml code examples if that would be of help for you in making #450 a reality. It's a very different language than the ones re2c currently supports so it might be a good test.
This is approximately what re2c needs to generate for the simplest example 01_basic.re (the match
part, not the whole program):
open Printf
let rec lex s yystate cursor =
match yystate with
| 0 ->
let yych = s.[cursor] in
let c = cursor + 1 in
(match yych with
| '1'..'9' -> lex s 2 c
| _ -> lex s 1 c)
| 1 -> false
| 2 ->
let yych = s.[cursor] in
(match yych with
| '0'..'9' ->
let c = cursor + 1 in
lex s 2 c
| _ -> lex s 3 cursor)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
if not (lex "1234\x00" 0 0) then raise (Failure "error")
let _ = main ()
It is based on the loop/switch mode with recursive function call instead of the loop. We could probably use goto/label mode as well with every state as a separate function and goto
replaced with a function call, but I think it would require more changes to codegen. For the above, we'll likely need another generic API primitive for the recursive invocation, as the user should be free to decide the signature of the lex
function. Things like let c = cursor + 1 in
and let yych = s.[cursor] in
map directly to YYSKIP
and YYPEEK
respectively, which are also defined by the user.
It should be possible to use this new codegen mode with other languages, as they all have recursive functions.
The most natural way to represent a state transition in OCaml is with a tail recursive function call; it will be much faster than most of the other options.
To echo @pmetzger, the way I would approach codegen here would be to just use tail recursive calls, and treat them like you treat goto
in the generated C or Go code. For example, the direct translation would yield something like (you can omit type annotations if you want)
let lex(s: string): bool =
let cursor : int ref = ref 0 in
let yych : char ref = ref s.[!cursor] in
let rec yy1(): bool =
cursor := !cursor + 1;
false
and yy2(): bool =
cursor := !cursor + 1;
yych := s.[!cursor];
(match !yych with
| '0'..'9' -> (yy2[@tailcall])()
| _ -> (yy3[@tailcall])())
and yy3(): bool = true in
match !yych with
| '1'..'9' -> (yy2[@tailcall])()
| _ -> (yy1[@tailcall])()
We have an outer lex
function that sets up some mutable state with ref
. We declare yy1
, yy2
, and yy3
as local closures, so they have access to this state. They are all tail-recursive, i.e. the last thing they do is either return a value or invoke another function. The invocations to (yy1[@tailcall])()
, (yy2[@tailcall])()
, and (yy3[@tailcall])()
are tail calls and the [@tailcall]
attribute will tell the ocaml compiler to optimize them into jumps, and warn if the functions don't meet the criteria (like if they're not tail recursive).
So basically,
[@tailcall]
.Note that the lex
function in my comment above is also tail-recursive (you can replace lex
with (lex [@tailcall])
, only there is one big tail-recursive function instead of many small ones. It is possible to go with many small functions (one for each state) if the new codegen mode is based on goto/label model, but it will probably require more work than adapting loop/switch model. Is there any clear advantage in one approach over the other?
If your approach will require less work during codegen, then I think it makes sense to go down that route. Tbh I just assumed that the goto/label model was more convenient for you.
The advantage of the "tail recursive function per state" approach is that it's going to have the highest possible performance. The code generator will turn it all into gotos (and then jump instructions in machine language.) This would also be the case for other functional languages like Haskell. Using the tailcall annotation in OCaml is not required, btw, but it will assure that the compiler will get angry if the call is not, in fact, tail recursive, thus avoiding mistakes.
Right, I also thought about performance. Trying to compare the generated code:
Program 1.ml uses recursive closures:
open Printf
type state = {
str: string;
mutable cur: int;
}
let lex st =
let rec yy0() =
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> (yy2 [@tailcall]) ()
| _ -> (yy1 [@tailcall]) ())
and yy1() = false
and yy2() =
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) ()
| _ -> (yy3 [@tailcall]) ())
and yy3() = true
in yy0 ()
let main () =
let st = { str = "1234\x00"; cur = 0; } in
if not (lex st) then raise (Failure "error")
let _ = main ()
Program 2.ml uses one recursive function with a match on yystate
:
open Printf
type state = {
str: string;
mutable cur: int;
mutable state: int;
}
let rec lex st =
let yystate = st.state in
match yystate with
| 0 ->
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 1; (lex [@tailcall]) st)
| 1 -> false
| 2 ->
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 3; (lex [@tailcall]) st)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
let st = { str = "1234\x00"; cur = 0; state = 0; } in
if not (lex st) then raise (Failure "error")
let _ = main ()
Let's see what instructions are generated by ocamlopt:
$ ocamlopt 1.ml && objdump -d 1.o && objdump -dr 1.o
...
0000000000000000 <caml1__lex_283>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 49 83 ef 68 sub $0x68,%r15
8: 4d 3b 3e cmp (%r14),%r15
b: 0f 82 99 00 00 00 jb aa <caml1__lex_283+0xaa>
11: 49 8d 5f 08 lea 0x8(%r15),%rbx
15: 48 c7 43 f8 f7 30 00 movq $0x30f7,-0x8(%rbx)
1c: 00
1d: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 24 <caml1__lex_283+0x24>
20: R_X86_64_REX_GOTPCRELX caml1__yy0_286-0x4
24: 48 89 3b mov %rdi,(%rbx)
27: 48 bf 17 00 00 00 00 movabs $0x100000000000017,%rdi
2e: 00 00 01
31: 48 89 7b 08 mov %rdi,0x8(%rbx)
35: 48 c7 43 10 f9 0c 00 movq $0xcf9,0x10(%rbx)
3c: 00
3d: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 44 <caml1__lex_283+0x44>
40: R_X86_64_REX_GOTPCRELX caml1__yy1_287-0x4
44: 48 89 7b 18 mov %rdi,0x18(%rbx)
48: 48 bf 11 00 00 00 00 movabs $0x100000000000011,%rdi
4f: 00 00 01
52: 48 89 7b 20 mov %rdi,0x20(%rbx)
56: 48 c7 43 28 f9 18 00 movq $0x18f9,0x28(%rbx)
5d: 00
5e: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 65 <caml1__lex_283+0x65>
61: R_X86_64_REX_GOTPCRELX caml1__yy2_288-0x4
65: 48 89 7b 30 mov %rdi,0x30(%rbx)
69: 48 bf 0b 00 00 00 00 movabs $0x10000000000000b,%rdi
70: 00 00 01
73: 48 89 7b 38 mov %rdi,0x38(%rbx)
77: 48 c7 43 40 f9 24 00 movq $0x24f9,0x40(%rbx)
7e: 00
7f: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 86 <caml1__lex_283+0x86>
82: R_X86_64_REX_GOTPCRELX caml1__yy3_289-0x4
86: 48 89 7b 48 mov %rdi,0x48(%rbx)
8a: 48 bf 05 00 00 00 00 movabs $0x100000000000005,%rdi
91: 00 00 01
94: 48 89 7b 50 mov %rdi,0x50(%rbx)
98: 48 89 43 58 mov %rax,0x58(%rbx)
9c: b8 01 00 00 00 mov $0x1,%eax
a1: 48 83 c4 08 add $0x8,%rsp
a5: e9 00 00 00 00 jmp aa <caml1__lex_283+0xaa>
a6: R_X86_64_PLT32 caml1__yy0_286-0x4
aa: e8 00 00 00 00 call af <caml1__lex_283+0xaf>
ab: R_X86_64_PLT32 caml_call_gc-0x4
af: e9 5d ff ff ff jmp 11 <caml1__lex_283+0x11>
b4: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
bb: 00 00 00 00
bf: 90 nop
00000000000000c0 <caml1__yy0_286>:
c0: 48 83 ec 08 sub $0x8,%rsp
c4: 4d 3b 3e cmp (%r14),%r15
c7: 76 71 jbe 13a <caml1__yy0_286+0x7a>
c9: 48 8b 7b 58 mov 0x58(%rbx),%rdi
cd: 48 8b 47 08 mov 0x8(%rdi),%rax
d1: 48 89 c6 mov %rax,%rsi
d4: 48 d1 fe sar %rsi
d7: 48 8b 17 mov (%rdi),%rdx
da: 48 8b 4a f8 mov -0x8(%rdx),%rcx
de: 48 c1 e9 0a shr $0xa,%rcx
e2: 48 8d 0c cd ff ff ff lea -0x1(,%rcx,8),%rcx
e9: ff
ea: 4c 0f b6 04 0a movzbq (%rdx,%rcx,1),%r8
ef: 4c 29 c1 sub %r8,%rcx
f2: 48 39 f1 cmp %rsi,%rcx
f5: 76 4a jbe 141 <caml1__yy0_286+0x81>
f7: 48 0f b6 34 32 movzbq (%rdx,%rsi,1),%rsi
fc: 48 8d 74 36 01 lea 0x1(%rsi,%rsi,1),%rsi
101: 48 83 c0 02 add $0x2,%rax
105: 48 89 47 08 mov %rax,0x8(%rdi)
109: 48 83 c6 9e add $0xffffffffffffff9e,%rsi
10d: 48 83 fe 11 cmp $0x11,%rsi
111: 76 15 jbe 128 <caml1__yy0_286+0x68>
113: 48 83 c3 18 add $0x18,%rbx
117: b8 01 00 00 00 mov $0x1,%eax
11c: 48 83 c4 08 add $0x8,%rsp
120: e9 00 00 00 00 jmp 125 <caml1__yy0_286+0x65>
121: R_X86_64_PLT32 caml1__yy1_287-0x4
125: 0f 1f 00 nopl (%rax)
128: 48 83 c3 30 add $0x30,%rbx
12c: b8 01 00 00 00 mov $0x1,%eax
131: 48 83 c4 08 add $0x8,%rsp
135: e9 00 00 00 00 jmp 13a <caml1__yy0_286+0x7a>
136: R_X86_64_PLT32 caml1__yy2_288-0x4
13a: e8 00 00 00 00 call 13f <caml1__yy0_286+0x7f>
13b: R_X86_64_PLT32 caml_call_gc-0x4
13f: eb 88 jmp c9 <caml1__yy0_286+0x9>
141: e8 00 00 00 00 call 146 <caml1__yy0_286+0x86>
142: R_X86_64_PLT32 caml_ml_array_bound_error-0x4
146: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
14d: 00 00 00
0000000000000150 <caml1__yy1_287>:
150: b8 01 00 00 00 mov $0x1,%eax
155: c3 ret
156: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
15d: 00 00 00
0000000000000160 <caml1__yy2_288>:
160: 48 83 ec 08 sub $0x8,%rsp
164: 4d 3b 3e cmp (%r14),%r15
167: 76 66 jbe 1cf <caml1__yy2_288+0x6f>
169: 48 8b 7b 28 mov 0x28(%rbx),%rdi
16d: 48 8b 47 08 mov 0x8(%rdi),%rax
171: 48 89 c6 mov %rax,%rsi
174: 48 d1 fe sar %rsi
177: 48 8b 17 mov (%rdi),%rdx
17a: 48 8b 4a f8 mov -0x8(%rdx),%rcx
17e: 48 c1 e9 0a shr $0xa,%rcx
182: 48 8d 0c cd ff ff ff lea -0x1(,%rcx,8),%rcx
189: ff
18a: 4c 0f b6 04 0a movzbq (%rdx,%rcx,1),%r8
18f: 4c 29 c1 sub %r8,%rcx
192: 48 39 f1 cmp %rsi,%rcx
195: 76 3f jbe 1d6 <caml1__yy2_288+0x76>
197: 48 0f b6 34 32 movzbq (%rdx,%rsi,1),%rsi
19c: 48 8d 74 36 01 lea 0x1(%rsi,%rsi,1),%rsi
1a1: 48 83 c6 a0 add $0xffffffffffffffa0,%rsi
1a5: 48 83 fe 13 cmp $0x13,%rsi
1a9: 76 15 jbe 1c0 <caml1__yy2_288+0x60>
1ab: 48 83 c3 18 add $0x18,%rbx
1af: b8 01 00 00 00 mov $0x1,%eax
1b4: 48 83 c4 08 add $0x8,%rsp
1b8: e9 00 00 00 00 jmp 1bd <caml1__yy2_288+0x5d>
1b9: R_X86_64_PLT32 caml1__yy3_289-0x4
1bd: 0f 1f 00 nopl (%rax)
1c0: 48 83 c0 02 add $0x2,%rax
1c4: 48 89 47 08 mov %rax,0x8(%rdi)
1c8: b8 01 00 00 00 mov $0x1,%eax
1cd: eb 95 jmp 164 <caml1__yy2_288+0x4>
1cf: e8 00 00 00 00 call 1d4 <caml1__yy2_288+0x74>
1d0: R_X86_64_PLT32 caml_call_gc-0x4
1d4: eb 93 jmp 169 <caml1__yy2_288+0x9>
1d6: e8 00 00 00 00 call 1db <caml1__yy2_288+0x7b>
1d7: R_X86_64_PLT32 caml_ml_array_bound_error-0x4
1db: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
00000000000001e0 <caml1__yy3_289>:
1e0: b8 03 00 00 00 mov $0x3,%eax
1e5: c3 ret
1e6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1ed: 00 00 00
Here's the second one:
$ ocamlopt 2.ml && objdump -d 2.o && objdump -dr 2.o
...
0000000000000000 <caml2__lex_284>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 48 89 c3 mov %rax,%rbx
7: 4d 3b 3e cmp (%r14),%r15
a: 0f 86 54 01 00 00 jbe 164 <caml2__lex_284+0x164>
10: 48 8b 43 10 mov 0x10(%rbx),%rax
14: 48 83 f8 07 cmp $0x7,%rax
18: 76 42 jbe 5c <caml2__lex_284+0x5c>
1a: 49 83 ef 18 sub $0x18,%r15
1e: 4d 3b 3e cmp (%r14),%r15
21: 0f 82 33 01 00 00 jb 15a <caml2__lex_284+0x15a>
27: 49 8d 47 08 lea 0x8(%r15),%rax
2b: 48 c7 40 f8 00 08 00 movq $0x800,-0x8(%rax)
32: 00
33: 48 8b 1d 00 00 00 00 mov 0x0(%rip),%rbx # 3a <caml2__lex_284+0x3a>
36: R_X86_64_REX_GOTPCRELX camlStdlib-0x4
3a: 48 8b 5b 30 mov 0x30(%rbx),%rbx
3e: 48 89 18 mov %rbx,(%rax)
41: 48 8b 1d 00 00 00 00 mov 0x0(%rip),%rbx # 48 <caml2__lex_284+0x48>
44: R_X86_64_REX_GOTPCRELX caml2__1-0x4
48: 48 89 58 08 mov %rbx,0x8(%rax)
4c: 49 8b 66 10 mov 0x10(%r14),%rsp
50: 41 8f 46 10 pop 0x10(%r14)
54: 41 5b pop %r11
56: 41 ff e3 jmp *%r11
59: 0f 1f 00 nopl (%rax)
5c: 48 d1 f8 sar %rax
5f: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # 66 <caml2__lex_284+0x66>
62: R_X86_64_PC32 .rodata-0x4
66: 48 63 04 82 movslq (%rdx,%rax,4),%rax
6a: 48 01 c2 add %rax,%rdx
6d: ff e2 jmp *%rdx
6f: 90 nop
70: 48 8b 43 08 mov 0x8(%rbx),%rax
74: 48 89 c7 mov %rax,%rdi
77: 48 d1 ff sar %rdi
7a: 48 8b 33 mov (%rbx),%rsi
7d: 48 8b 56 f8 mov -0x8(%rsi),%rdx
81: 48 c1 ea 0a shr $0xa,%rdx
85: 48 8d 14 d5 ff ff ff lea -0x1(,%rdx,8),%rdx
8c: ff
8d: 48 0f b6 0c 16 movzbq (%rsi,%rdx,1),%rcx
92: 48 29 ca sub %rcx,%rdx
95: 48 39 fa cmp %rdi,%rdx
98: 0f 86 d0 00 00 00 jbe 16e <caml2__lex_284+0x16e>
9e: 48 0f b6 3c 3e movzbq (%rsi,%rdi,1),%rdi
a3: 48 8d 7c 3f 01 lea 0x1(%rdi,%rdi,1),%rdi
a8: 48 83 c0 02 add $0x2,%rax
ac: 48 89 43 08 mov %rax,0x8(%rbx)
b0: 48 83 c7 9e add $0xffffffffffffff9e,%rdi
b4: 48 83 ff 11 cmp $0x11,%rdi
b8: 76 12 jbe cc <caml2__lex_284+0xcc>
ba: 48 c7 43 10 03 00 00 movq $0x3,0x10(%rbx)
c1: 00
c2: 48 89 d8 mov %rbx,%rax
c5: e9 3a ff ff ff jmp 4 <caml2__lex_284+0x4>
ca: 66 90 xchg %ax,%ax
cc: 48 c7 43 10 05 00 00 movq $0x5,0x10(%rbx)
d3: 00
d4: 48 89 d8 mov %rbx,%rax
d7: e9 28 ff ff ff jmp 4 <caml2__lex_284+0x4>
dc: b8 01 00 00 00 mov $0x1,%eax
e1: 48 83 c4 08 add $0x8,%rsp
e5: c3 ret
e6: 66 90 xchg %ax,%ax
e8: 48 8b 43 08 mov 0x8(%rbx),%rax
ec: 48 89 c7 mov %rax,%rdi
ef: 48 d1 ff sar %rdi
f2: 48 8b 33 mov (%rbx),%rsi
f5: 48 8b 56 f8 mov -0x8(%rsi),%rdx
f9: 48 c1 ea 0a shr $0xa,%rdx
fd: 48 8d 14 d5 ff ff ff lea -0x1(,%rdx,8),%rdx
104: ff
105: 48 0f b6 0c 16 movzbq (%rsi,%rdx,1),%rcx
10a: 48 29 ca sub %rcx,%rdx
10d: 48 39 fa cmp %rdi,%rdx
110: 76 5c jbe 16e <caml2__lex_284+0x16e>
112: 48 0f b6 3c 3e movzbq (%rsi,%rdi,1),%rdi
117: 48 8d 7c 3f 01 lea 0x1(%rdi,%rdi,1),%rdi
11c: 48 83 c7 a0 add $0xffffffffffffffa0,%rdi
120: 48 83 ff 13 cmp $0x13,%rdi
124: 76 12 jbe 138 <caml2__lex_284+0x138>
126: 48 c7 43 10 07 00 00 movq $0x7,0x10(%rbx)
12d: 00
12e: 48 89 d8 mov %rbx,%rax
131: e9 ce fe ff ff jmp 4 <caml2__lex_284+0x4>
136: 66 90 xchg %ax,%ax
138: 48 83 c0 02 add $0x2,%rax
13c: 48 89 43 08 mov %rax,0x8(%rbx)
140: 48 c7 43 10 05 00 00 movq $0x5,0x10(%rbx)
147: 00
148: 48 89 d8 mov %rbx,%rax
14b: e9 b4 fe ff ff jmp 4 <caml2__lex_284+0x4>
150: b8 03 00 00 00 mov $0x3,%eax
155: 48 83 c4 08 add $0x8,%rsp
159: c3 ret
15a: e8 00 00 00 00 call 15f <caml2__lex_284+0x15f>
15b: R_X86_64_PLT32 caml_call_gc-0x4
15f: e9 c3 fe ff ff jmp 27 <caml2__lex_284+0x27>
164: e8 00 00 00 00 call 169 <caml2__lex_284+0x169>
165: R_X86_64_PLT32 caml_call_gc-0x4
169: e9 a2 fe ff ff jmp 10 <caml2__lex_284+0x10>
16e: e8 00 00 00 00 call 173 <caml2__lex_284+0x173>
16f: R_X86_64_PLT32 caml_ml_array_bound_error-0x4
173: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
17a: 00 00 00 00
17e: 66 90 xchg %ax,%ax
re2c can generate both examples. Which one is faster?
I haven't measured it yet.
Will be curious about your measurements. Also, which compiler are you using? flambda will be more aggressive here.
Some measurements with ocamlopt:
a.ml:
type state = {
str: string;
mutable cur: int;
}
let lex st =
let rec yy0() =
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> (yy2 [@tailcall]) ()
| _ -> (yy1 [@tailcall]) ())
and yy1() = false
and yy2() =
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) ()
| _ -> (yy3 [@tailcall]) ())
and yy3() = true
in yy0 ()
let main () =
let s = "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\x00" in
for i = 1 to 100000000 do
let st = { str = s; cur = 0; } in
if not (lex st) then raise (Failure "error")
done
let _ = main ()
b.ml:
type state = {
str: string;
mutable cur: int;
mutable state: int;
}
let rec lex st =
let yystate = st.state in
match yystate with
| 0 ->
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 1; (lex [@tailcall]) st)
| 1 -> false
| 2 ->
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 3; (lex [@tailcall]) st)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
let s = "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\x00" in
for i = 1 to 100000000 do
let st = { str = s; cur = 0; state = 0 } in
if not (lex st) then raise (Failure "error")
done
let _ = main ()
Compiled as:
$ ocamlopt a.ml -o a
$ ocamlopt b.ml -o b
Run time:
$ time ./a
real 0m16.163s
user 0m16.162s
sys 0m0.001s
$ time ./b
real 0m22.179s
user 0m22.178s
sys 0m0.000s
So, many recursive closures are considerably faster in this example.
perf stat on the same binaries shows more instructions and in particular more branches for b (which should be explained by the one extra conditional jump on yystate
):
$ perf stat ./a
Performance counter stats for './a':
16,115.76 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
609 page-faults:u # 37.789 /sec
64,101,457,390 cycles:u # 3.978 GHz
237,516,872,339 instructions:u # 3.71 insn per cycle
41,103,312,263 branches:u # 2.551 G/sec
100,133,270 branch-misses:u # 0.24% of all branches
16.116135057 seconds time elapsed
16.115046000 seconds user
0.001000000 seconds sys
$ perf stat ./b
Performance counter stats for './b':
22,093.66 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
605 page-faults:u # 27.383 /sec
87,899,791,871 cycles:u # 3.979 GHz
326,339,866,047 instructions:u # 3.71 insn per cycle
61,404,372,905 branches:u # 2.779 G/sec
100,058,371 branch-misses:u # 0.16% of all branches
22.096008485 seconds time elapsed
22.090730000 seconds user
0.002999000 seconds sys
ocamlopt built with flambda support and -O3 gives approximately the same results, only both binaries are slightly faster (15.400s vs 21.624s).
I would have expected both to be a bit faster; somewhat surprised that it isn't more than that, but perhaps this isn't something flambda is particularly good at optimizing.
Anyway, I suppose this means that as suspected, the "goto" version is faster than the "big switch" version, and I suspect that the difference would be even bigger for very large state machines.
("goto" in this case being a reference to "Lambda: The Ultimate Goto". Tail recursion is by far my favorite goto.)
OCaml support has been added in experimental branch syntax-files
: https://github.com/skvadrik/re2c/commit/c1ccefacaf35427dffe8c43f82a534adbee49225. This is not the final release yet, some things may still change. See examples.
E.g. for the basic example:
(* re2ocaml $INPUT -o $OUTPUT -i *)
type state = {
str: string;
mutable cur: int;
}
/*!re2c
re2c:define:YYFN = ["lex;bool", "st;state"];
re2c:define:YYCTYPE = int;
re2c:define:YYPEEK = "Char.code st.str.[st.cur]";
re2c:define:YYSKIP = "st.cur <- st.cur + 1;";
re2c:yyfill:enable = 0;
number = [1-9][0-9]*;
number { true }
* { false }
*/
let main () =
let st = {str = "1234\x00"; cur = 0}
in if not (lex st) then raise (Failure "error")
let _ = main ()
re2ocaml generates the following code:
(* Generated by re2c *)
(* re2ocaml $INPUT -o $OUTPUT -i *)
type state = {
str: string;
mutable cur: int;
}
let rec yy0 (st : state) : bool =
let yych = Char.code st.str.[st.cur] in
st.cur <- st.cur + 1;
match yych with
| 0x31|0x32|0x33|0x34|0x35|0x36|0x37|0x38|0x39 -> (yy2 [@tailcall]) st
| _ -> (yy1 [@tailcall]) st
and yy1 (st : state) : bool =
false
and yy2 (st : state) : bool =
let yych = Char.code st.str.[st.cur] in
match yych with
| 0x30|0x31|0x32|0x33|0x34|0x35|0x36|0x37|0x38|0x39 ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) st
| _ -> (yy3 [@tailcall]) st
and yy3 (st : state) : bool =
true
and lex (st : state) : bool =
(yy0 [@tailcall]) st
let main () =
let st = {str = "1234\x00"; cur = 0}
in if not (lex st) then raise (Failure "error")
let _ = main ()
That's amazing @skvadrik, I'm excited to play with this and get back to you!
@skvadrik Should I publicize this among the OCaml community? I think some folks would find it interesting.
@skvadrik Should I publicize this among the OCaml community? I think some folks would find it interesting.
Sounds great, but let's wait until the API is finalized. I think I'll implement a few more language backends via syntax configs, and I also plan to tweak a few things in OCaml backend. I don't expect major changes though.
Python might be interesting, FWIW.
OCaml is quite popular in language development, and has a dearth of lexer generators with good unicode support. It's a bit different from the imperative languages that re2c currently supports, i.e. it doesn't support labelled
break
,continue
,goto
, or the ability to earlyreturn
. However, it does have good support for tail-call optimization, exceptions, algebraic effects, and of courseif-else
(for the nested-ifs code generation).Is there interest in supporting such a backend? If one were interested in contributing support, where should one look?
src/codegen
?Thanks