Open yorickpeterse opened 6 years ago
Regex interning could be done by reusing interned strings and mapping these to their regex objects. This way re-using the same regex literals would result in only 1 heap allocation. Since all of this happens when parsing bytecode it won't have any runtime overhead (once all modules are loaded).
Rust's regex crate would be the most likely underlying engine used for this. I in particular like its support for limiting the size of compiled regular expressions, which would be an interesting feature to expose to the Inko runtime.
The VM instructions we'd need (that I can think of) would be:
Per https://github.com/rust-lang/regex/issues/362 it seems the regex
crate relies on thread-local storage, but doesn't clean it properly. This is problematic for two reasons:
One option might be to ditch supporting regular expressions in the standard library, instead deferring this to a C library used through the FFI API.
Alternatively, we could perhaps support Rosie:
Rosie appears to be offloading most of its work to Lua, with the C library being a wrapped around the Lua embedding API (if I'm reading everything properly). This may result in undesirable overhead in the context of using Rosie from Inko.
Rosie seems like an unlikely candidate, but the underlying idea (reusable and testable PEG grammars) is very interesting.
There are currently two ideas I'm toying with:
Syntax based grammars would be more verbose compared to regular expressions, but also easier to read. Let's say we want to parse an ISO 8601 date. Using regular expressions, we may end up with something like the following:
/\d+-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2}/
The syntax based grammar would instead look something like the following:
grammar {
main { year '-' month '-' day ' '+ hour ':' minute ':' second }
digit { [0-9] }
year { digit+ }
month { digit digit? }
day { digit digit? }
hour { digit digit }
minute { digit digit }
second { digit digit }
}
Here main
would be the entry point of the grammar. This is much more verbose, but easier to read. If one were able to import existing grammars, they might be able to reduce it down to the following:
grammar {
import time
# year_month_day and hour_minute_second would be defined in the imported package.
main { year_month_day ':' hour_minute_second }
}
A big downside of syntax grammars is the work necessary to implement them. This would require the compiler to parse and verify the syntax, and generate the necessary VM instructions or Inko code to parse the input efficiently. Using an existing Rust crate might allow us to offload all of this to said crate at runtime, but thus far I have been unable to find a well maintained crate for parsing PEG at run time; most only support parsing them at compile time.
Parsing combinators are just methods that take some kind of input, and return a parser for that input. To parse the same ISO 8601 date mentioned above, you may write the following:
import std::parse
let digit = parse.any(['0', '1', '2', '3' ,'4', '5', '6', '7', '8', '9'])
let space = parse.character(' ')
let spaces = parse.many(space)
let year = parse.many(digit)
let month = digit.optional(digit)
let day = digit.optional(digit)
let double_digit = digit.then(digit)
let parser = year
.then('-')
.then(month)
.then('-')
.then(day)
.then(spaces)
.then(double_digit)
.then(':')
.then(double_digit)
.then(':')
.then(double_digit)
While this approach is much more verbose, it has several benefits:
There are also a few downsides to parser combinators:
A | A B
the A B
branch might not be executed if the engine is LL(1).Also worth adding:
Regular expressions are great for very simple patterns, such as \d+
and Hello \w+
. Unfortunately, they do not scale well to complex patterns as they quickly become very unreadable.
PEG grammars and parsing combinators (regardless of the parsing algorithm used) sacrifice compactness for readability and maintainability. This means that simple PEGs/combinators might be more verbose compared to the equivalent regular expressions, but they (in my opinion) tend to be equally readable when used for very complex parsers.
In other words: the "startup cost" of PEGs/combinators is a bit higher, but it stays more or less the same. Regular expressions on the other hand start out fairly straightforward, but become more painful to work with very quickly.
added to epic &1
There are three questions to think about here:
In all cases the answer comes down to "That depends", but I used GitLab CE to try and get some more data. In CE, there are a total of 556 regular expressions in application code (excluding configuration and tests). I used the following script for this:
Using this on CE, the output is:
Going through the individual regular expressions, the complexity varies greatly. Some are simple (e.g. /./
), while others are complex like this one:
# Code blocks:
# ```
# Anything, including `/cmd arg` which are ignored by this filter
# ```
^```
.+?
\n```$
)
|
(?
# HTML block:
#
# Anything, including `/cmd arg` which are ignored by this filter
#
^<[^>]+?>\n
.+?
\n<\/[^>]+?>$
)
|
(?
# Quote block:
# >>>
# Anything, including `/cmd arg` which are ignored by this filter
# >>>
^>>>
.+?
\n>>>$
)
|
(?:
# Command not in a blockquote, blockcode, or HTML tag:
# /close
^\/
(?#{Regexp.new(Regexp.union(names).source, Regexp::IGNORECASE)})
(?:
[ ]
(?[^\n]*)
)?
(?:\n|$)
)
}mix
```
For many of the more complex ones it seems like a proper parser would have been much more suitable. For simple regular expressions I suspect a parser might be overkill, even when using parsing combinators. Lua's patterns system might be more suitable, but I think it somewhat suffers from the same issues as regular expressions: it works for small/simple patterns, but likely doesn't scale to more complex ones.
To illustrate, here are some simple patterns from the CE codebase:
Most of these are still reasonably easy to understand. The following patterns however already get more complicated:
The complexity here comes from using different anchors (\A
and \z
) and other non self-describing character classes. Of course one familiar with regular expressions will recognise these, but /key-(\d+)/
is still harder to understand than something like string('key-').digits
.
removed from epic &1
The VM should support a simple regular expression type and various operations for compiling regular expressions, finding matches, etc. The compiler / runtime in turn should use this. There are two ways this can be exposed to the language:
/foo/
) are used, and preferably the regex (at least for literals) is compiled when parsing the bytecode