ocaml-community / sedlex

An OCaml lexer generator for Unicode
MIT License
235 stars 43 forks source link

Optimize sedlexing size and performance #99

Closed bobzhang closed 3 years ago

bobzhang commented 3 years ago

ee20e3c used Sedlexing.next_int instead of Sedlexing.next There are two benefits:

The other two commits are self explained, c5d0eb2 reduced its size by around 10%

This should be a non breaking change

pmetzger commented 3 years ago

Is there a reason not to return a Uchar instead of an int and use an unusual value like U+FFFF? It would feel a bit less dirty than using an int; the new code is almost C-like. The performance should in practice be identical. (Interesting question, btw, is why returning an option should be so dramatically much more expensive.)

bobzhang commented 3 years ago

@pmetzger the beauty of the trick is that -1 is also used in the continuations so that the generated code size is reduced to its half. -1 is an invalid unicode in both 32 and 64 platforms. Note Sedlexing.next_int should not be used by the clients in general (like Sedlexing.mark etc).

why returning an option should be so dramatically much more expensive

The benefit is in two fold: code size reduction and performance enhancement. The code size reduction weights more here, the performance improvement is around 1~2% in the whole pipeline.

pmetzger commented 3 years ago

Why is the code size so different?

I suppose I feel allergic to this sort of thing because of all the horrible "-1 means error" based bugs that I dealt with in my 35 years of C programming. Having an option type is so much nicer. But if it will really make things much better, we clearly should do it. That said, I would hope there would be some way we could fix things to be better without such kludges?

bobzhang commented 3 years ago

I suppose I feel allergic to this sort of thing because of all the horrible "-1 means error" based bugs that I dealt with in my 35 years of C programming.

Note Sedlexing.next_int is not supposed to be used by users, we can change its name to something starts with internal etc. Would it make sense?

I wish we could write elegant code and the compiler will do the magic work to make it efficient, but we are not there

toots commented 3 years ago

I suppose I feel allergic to this sort of thing because of all the horrible "-1 means error" based bugs that I dealt with in my 35 years of C programming.

Note Sedlexing.next_int is not supposed to be used by users, we can change its name to something starts with internal etc. Would it make sense?

I wish we could write elegant code and the compiler will do the magic work to make it efficient, but we are not there

What are the other alternatives to using -1? I can think of:

Could any of those work?

pmetzger commented 3 years ago

Using Option is using variants and pattern matching.

Using exceptions might be an interesting possibility here for some users; it doesn't seem particularly worse than using a distinguished value, and it's more OCamlish.

pmetzger commented 3 years ago

Note Sedlexing.next_int is not supposed to be used by users, we can change its name to something starts with internal etc. Would it make sense?

That would at least help. Can you explain exactly the context in which this is being used btw? It might make what's going on clearer.

bobzhang commented 3 years ago

Can you explain exactly the context in which this is being used btw? It might make what's going on clearer.

next_int (or we can call unsafe_next if you like). It is a function only used by the ppx, not client facing. The thing is that ocaml does not provide package level private modifier, so we have to make it public.

Note using -1 is not ideal, but it looks acceptable to me, compare also returns -1.

Using variants and pattern matching?

That's the current solution

Using a immediate non-local return via an exception

That would defeat the purpose of optimization.

I think my patch is fairly safe, -1 is never a valid unicode either on 64 or 32 platforms, and it is supposed to be internal functions. This is actually something I like ocaml, when I need performance, I can drop into C style, and provide a safe interface for clients.

toots commented 3 years ago

Can you explain exactly the context in which this is being used btw? It might make what's going on clearer.

next_int (or we can call unsafe_next if you like). It is a function only used by the ppx, not client facing. The thing is that ocaml does not provide package level private modifier, so we have to make it public.

Gotcha. Yeah, a name like private_next_int would help.

Note using -1 is not ideal, but it looks acceptable to me, compare also returns -1.

Using variants and pattern matching?

That's the current solution

Word.

Using a immediate non-local return via an exception

That would defeat the purpose of optimization.

I think my patch is fairly safe, -1 is never a valid unicode either on 64 or 32 platforms, and it is supposed to be internal functions. This is actually something I like ocaml, when I need performance, I can drop into C style, and provide a safe interface for clients.

I think what @pmetzger is alluding to, and I can relate, is that, if someone down the road has to debug a problem with this code some time from now and rans across a -1 value, it won't be obvious what is going on.

However, and sorry for not taking the time to look at it before, it looks like next_int is just doing earlier what is already done later in the code:

    (pexp_function ~loc [
      case 
        ~lhs:(ppat_construct ~loc (lident_loc ~loc "Some") (Some (pvar ~loc "uc")))
        ~guard:None
        ~rhs:[%expr let c = Uchar.to_int uc in [%e body]];
      case 
        ~lhs:(ppat_construct ~loc (lident_loc ~loc "None") None)
        ~guard:None
        ~rhs:[%expr let c = (-1) in [%e body]]])

In this case, the changes are pretty much equivalent except that we skip a pattern matching case that makes the code execute faster. Correct?

bobzhang commented 3 years ago

Worth noting that %%private reliably breaks editor tooling like types on hover in submodules faster.

Yes. It also makes the code smaller since body is not duplicated any more

pmetzger commented 3 years ago

Okay, so thinking about it, I think I'm okay with this if it is kept internal to the ppx and if the reason we are doing it is documented with comments that explain the substantial performance / code size improvement. I would suggest that we avoid a bit of the code duplication, perhaps by using the private version inlined into the public routine.

toots commented 3 years ago

@pmetzger I have pushed https://github.com/ocaml-community/sedlex/pull/101, which is essentially a upgrade of dune and reorginization of the compilation units. The problem is that next_int needs to access some internals of Sedlexing, which are not available in the public API.

Rather than exposing next_int in the public API, this PR creates an internal module that can be shared between the ppx rewriter and the internal implementation of Sedlexing.

I believe that with this, #101 should be addressing all the concerns here.

pmetzger commented 3 years ago

I like #101 .

toots commented 3 years ago

Fixed with the merge of #101