inko-lang / inko

A language for building concurrent software with confidence
http://inko-lang.org/
Mozilla Public License 2.0
820 stars 38 forks source link

Regular expressions #161

Open yorickpeterse opened 6 years ago

yorickpeterse commented 6 years ago

The VM should support a simple regular expression type and various operations for compiling regular expressions, finding matches, etc. The compiler / runtime in turn should use this. There are two ways this can be exposed to the language:

  1. Python style where a "regex" module is used to compile a regular expression from a string
  2. Ruby style where regex literals (e.g. /foo/) are used, and preferably the regex (at least for literals) is compiled when parsing the bytecode
yorickpeterse commented 6 years ago

Regex interning could be done by reusing interned strings and mapping these to their regex objects. This way re-using the same regex literals would result in only 1 heap allocation. Since all of this happens when parsing bytecode it won't have any runtime overhead (once all modules are loaded).

yorickpeterse commented 6 years ago

Rust's regex crate would be the most likely underlying engine used for this. I in particular like its support for limiting the size of compiled regular expressions, which would be an interesting feature to expose to the Inko runtime.

The VM instructions we'd need (that I can think of) would be:

yorickpeterse commented 5 years ago

Per https://github.com/rust-lang/regex/issues/362 it seems the regex crate relies on thread-local storage, but doesn't clean it properly. This is problematic for two reasons:

  1. Regular expressions might be used on different threads from which they were created. It's not clear what impact this would have performance wise.
  2. We would leak memory by using this crate.
yorickpeterse commented 5 years ago

One option might be to ditch supporting regular expressions in the standard library, instead deferring this to a C library used through the FFI API.

Alternatively, we could perhaps support Rosie:

yorickpeterse commented 5 years ago

Rosie appears to be offloading most of its work to Lua, with the C library being a wrapped around the Lua embedding API (if I'm reading everything properly). This may result in undesirable overhead in the context of using Rosie from Inko.

yorickpeterse commented 5 years ago

Rosie seems like an unlikely candidate, but the underlying idea (reusable and testable PEG grammars) is very interesting.

There are currently two ideas I'm toying with:

  1. Syntax based grammars, similar to Perl 6's rules.
  2. Parsing combinators

Syntax grammars

Syntax based grammars would be more verbose compared to regular expressions, but also easier to read. Let's say we want to parse an ISO 8601 date. Using regular expressions, we may end up with something like the following:

/\d+-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2}/

The syntax based grammar would instead look something like the following:

grammar {
  main { year '-' month '-' day ' '+ hour ':' minute ':' second }

  digit { [0-9] }
  year { digit+ }
  month { digit digit? }
  day { digit digit? }
  hour { digit digit }
  minute { digit digit }
  second { digit digit }
}

Here main would be the entry point of the grammar. This is much more verbose, but easier to read. If one were able to import existing grammars, they might be able to reduce it down to the following:

grammar {
  import time

  # year_month_day and hour_minute_second would be defined in the imported package.
  main { year_month_day ':' hour_minute_second }
}

A big downside of syntax grammars is the work necessary to implement them. This would require the compiler to parse and verify the syntax, and generate the necessary VM instructions or Inko code to parse the input efficiently. Using an existing Rust crate might allow us to offload all of this to said crate at runtime, but thus far I have been unable to find a well maintained crate for parsing PEG at run time; most only support parsing them at compile time.

Parsing combinators

Parsing combinators are just methods that take some kind of input, and return a parser for that input. To parse the same ISO 8601 date mentioned above, you may write the following:

import std::parse

let digit = parse.any(['0', '1', '2', '3' ,'4', '5', '6', '7', '8', '9'])
let space = parse.character(' ')
let spaces = parse.many(space)
let year = parse.many(digit)
let month = digit.optional(digit)
let day = digit.optional(digit)
let double_digit = digit.then(digit)

let parser = year
  .then('-')
  .then(month)
  .then('-')
  .then(day)
  .then(spaces)
  .then(double_digit)
  .then(':')
  .then(double_digit)
  .then(':')
  .then(double_digit)

While this approach is much more verbose, it has several benefits:

  1. We do not need to extend the compiler or VM, as everything is done at run time. This means we could implement it today.
  2. Since everything is done at run time, it's easier to combine this with custom code that needs to be run at certain points.
  3. Since it's just regular code, you should in theory be able to better control the errors produced.
  4. No dependencies will be necessary.
  5. The produced parsers can be optimised like any other Inko code, instead of requiring separate optimisation strategies.
  6. Since it's just regular code, you can share your parsers like any other Inko code.

There are also a few downsides to parser combinators:

  1. They can get verbose pretty quickly. The above date example is already quite verbose, though this can be reduced if Inko were to provide a library of common parsers.
  2. Because everything happens at run time, the compiler can't really optimise and verify the code in any special way. For example, when using a grammar a compiler might be able to detect in grammar A | A B the A B branch might not be executed if the engine is LL(1).
  3. The type signatures necessary to make all of this happen may end up being rather verbose.
  4. They're not very commonly used outside of functional programming languages, let alone as a replacement for regular expressions.
yorickpeterse commented 5 years ago

Also worth adding:

Regular expressions are great for very simple patterns, such as \d+ and Hello \w+. Unfortunately, they do not scale well to complex patterns as they quickly become very unreadable.

PEG grammars and parsing combinators (regardless of the parsing algorithm used) sacrifice compactness for readability and maintainability. This means that simple PEGs/combinators might be more verbose compared to the equivalent regular expressions, but they (in my opinion) tend to be equally readable when used for very complex parsers.

In other words: the "startup cost" of PEGs/combinators is a bit higher, but it stays more or less the same. Regular expressions on the other hand start out fairly straightforward, but become more painful to work with very quickly.

yorickpeterse commented 5 years ago

added to epic &1

yorickpeterse commented 5 years ago

There are three questions to think about here:

  1. How often are regular expressions used?
  2. When they are used, are regular expressions actually the right tool?
  3. How complex do these regular expressions tend to be.

In all cases the answer comes down to "That depends", but I used GitLab CE to try and get some more data. In CE, there are a total of 556 regular expressions in application code (excluding configuration and tests). I used the following script for this:

Ruby script ```ruby # frozen_string_literal: true # gem install parallel require 'ripper' require 'parallel' patterns = Parallel .map(Dir['./{app,lib}/**/*.rb'], in_processes: 4) do |path| found = [] buffer = '' in_regex = false Ripper.lex(File.read(path)).each do |token| if token[1] == :on_regexp_beg in_regex = true end if in_regex buffer += token[2] end if token[1] == :on_regexp_end in_regex = false found << buffer.gsub("\n", '\n') end end found end .inject([]) { |acc, curr| acc.concat(curr) } buckets = Hash.new(0) patterns.each do |pattern| bucket = if pattern.length >= 100 pattern.length.round(-2) else pattern.length.round(-1) end buckets[bucket] += 1 end puts 'Bucket Amount' buckets.sort.each do |(bucket, value)| puts "#{bucket.to_s.ljust(6, ' ')} #{value.to_s.ljust(6, ' ')} #{'▮' * value}" end ```


Using this on CE, the output is:

Output ``` Bucket Amount 0 16 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 10 99 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 20 85 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 30 52 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 40 50 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 50 39 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 60 18 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 70 14 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 80 20 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 90 7 ▮▮▮▮▮▮▮ 100 44 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 200 31 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 300 20 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 400 15 ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 500 10 ▮▮▮▮▮▮▮▮▮▮ 600 9 ▮▮▮▮▮▮▮▮▮ 700 2 ▮▮ 800 8 ▮▮▮▮▮▮▮▮ 900 2 ▮▮ 1000 7 ▮▮▮▮▮▮▮ 1100 4 ▮▮▮▮ 1200 1 ▮ 1400 3 ▮▮▮ ```


Going through the individual regular expressions, the complexity varies greatly. Some are simple (e.g. /./), while others are complex like this one:

Regular expression ```ruby %r{ (? # Code blocks: # ``` # Anything, including `/cmd arg` which are ignored by this filter # ``` ^``` .+? \n```$ ) | (? # HTML block: # # Anything, including `/cmd arg` which are ignored by this filter # ^<[^>]+?>\n .+? \n<\/[^>]+?>$ ) | (? # Quote block: # >>> # Anything, including `/cmd arg` which are ignored by this filter # >>> ^>>> .+? \n>>>$ ) | (?: # Command not in a blockquote, blockcode, or HTML tag: # /close ^\/ (?#{Regexp.new(Regexp.union(names).source, Regexp::IGNORECASE)}) (?: [ ] (?[^\n]*) )? (?:\n|$) ) }mix ```


For many of the more complex ones it seems like a proper parser would have been much more suitable. For simple regular expressions I suspect a parser might be overkill, even when using parsing combinators. Lua's patterns system might be more suitable, but I think it somewhat suffers from the same issues as regular expressions: it works for small/simple patterns, but likely doesn't scale to more complex ones.

To illustrate, here are some simple patterns from the CE codebase:

Patterns ```ruby // /./ /^2/ /\n/ /\s/ /\s/ /-+/ /.+/ /.+/ /.+/ /\t/ /\s/ /../ /=~/ /==/ /!=/ /\s+/ /\d+/ /^\?/ /\s+/ ```


Most of these are still reasonably easy to understand. The following patterns however already get more complicated:

Complicated patterns ```ruby %r{\/.*\/} /\A(\d+)-/ /[\w\.-]+/ /\s+//\s+/ /^readme/i /.*\.map$/ /#(\d+)\z/ %r{\Adoc/} /section_/ /\A[^,]+\z/ /\A[^@]+\z/ %r{^/\w.*$} /[\w%.+-]+/ /\s//[<>+]/ /key-(\d+)/ /[^\s\w.-]/ %r{^tcp://} /\A\h{64}\z/ /\A\h{40}\z/ /[^-a-z0-9]/ /(\[[xX]\])/ /\A\w{8,}\z/ %r{\A.*?://} /\A\h{40}\z/ /\AGitlab::/ ```


The complexity here comes from using different anchors (\A and \z) and other non self-describing character classes. Of course one familiar with regular expressions will recognise these, but /key-(\d+)/ is still harder to understand than something like string('key-').digits.

yorickpeterse commented 5 years ago

removed from epic &1