timtadh / lexmachine

Lex machinary for go.
Other
405 stars 28 forks source link

lexmachine doesn't understand common regular expressions #2

Closed matjam closed 7 years ago

matjam commented 7 years ago

I don't understand the computer science behind how this thing works. But I'm finding it useful in lexing a DSL I've written.

It looks like you've written your own regular expression parser? It doesn't seem to handle things like

[A-Za-z]

There's other thinggs I'm trying to match, like this

([\._/:a-zA-Z]+):"(.+)"

but it's no bueno either.

Is there an explanation of your regex parser? Any chance it can be swapped out for something else, like the standard golang regular expression parser?

matjam commented 7 years ago

I guess, I should have read, it only supports a "restricted set" of regular expressions.

Any chance you can document what that is?

timtadh commented 7 years ago

@matjam It is true that it doesn't have full support for all of the features in posix. The character class support is particularly limited (which is too bad). I didn't have a lot of time to get the parser to support the full set. For character classes you can basically have [a-z] or [^asdfawe]. so

[A-Za-z] --> ([A-Z]|[a-z])

and

([\._/:a-zA-Z]+):"(.+)" --> ((\\)|(\.)|(_)|(/)|(:)|[a-z]|[A-Z])+:"(.+)"

Feel free to open a stub PR for documentation. I can try and fill it in with other stuff.

(NB: I did not test the above regex's, make sure you do before you ship it!)

timtadh commented 7 years ago

I am going to work on a patch to at least improve the character class support.

matjam commented 7 years ago

if you can

\w
\W
\s
\S
\d
\D

etc would be super nice.

BTW, the API is very nice. I looked at all kinds of different ways to do lexing in Go, and lexmachine's interface was by far the easiest to get working. Nice job ;-)

timtadh commented 7 years ago

Not sure about the * built-in classes at first. Can you open up a second ticket and point to what you think each should mean? I think there is some disagreement between implementations.

matjam commented 7 years ago

Honestly, if you go with whatever Go thinks something means, thats good enough for me.

timtadh commented 7 years ago

Unfortunately, lexmachine is a byte oriented lexer and go's regex engine is a unicode aware engine. I do not have any plans to support unicode directly so doing "what go does" is not on the table. By the way, you can use lex unicode with lexmachine you just need to handle it encoded rather than as code points. However, since lexmachine is oblivious to encodings there is no reasonable way to implement the various pre-defined character classes except on the ASCII range.

I am not going to implement these classes as part of this ticket but I am going to improve the character class support in general. Please open a second ticket for the built-in classes to discuss them there.

timtadh commented 7 years ago

@matjam please try out the above and let me know what you think. I was looking into the go definitions for the character classes and it looks like it ignores unicode for them so that might be doable after all: https://github.com/google/re2/wiki/Syntax#perl.