Regex literals - Githubissues

faultyserver commented 7 years ago

Regex literals are an important part of modern scripting languages. Having a concise syntax for instantiating patterns and performing matches is important.

For actually performing matches, I think using the existing pattern-matching syntax would be nice:

/(?<identifier>[a-zA-Z][a-zA-Z0-9]+)/ =: "   matchedvalue "
puts identifier #=> matchedvalue

This syntax would only work with named subgroups, but seems incredibly clear and concise in terms of extracting values. Compare this to the Ruby equivalent:

matches = /(?<identifier>[a-zA-Z][a-zA-Z0-9]+)/.match("   matchedvalue ")
puts matches["identifier"] #=> matchedvalue

Crystal has direct support for PCRE regexes, so the actual matching aspect should be simple to implement.

This needs more thought for unnamed captures, interpolation, and other edge cases, but these are my initial thoughts.

faultyserver commented 7 years ago

The big thing I'd like to avoid here is setting a global variable. Ruby sets $~, which is useful for golfing, but that's about it.

However, the current semantics of the pattern-match operator are to return the right-hand-side value to allow pattern-matching assignments to be chained inline with simple assignments:

all = {thing1: :a, thing2: :b} =: function_call_returning_map()

I don't remember my particular reasoning for that decision, but I'd probably be open to returning the left-hand-side instead as an anonymous value on the stack, though that may have some implications to the syntax of pattern-matched function parameters.

faultyserver commented 7 years ago

I read something a week or so ago giving a general dissent about using // as the notation for Regex literals; the main reason being that it's ambiguous with mathematic division.

While that ambiguity isn't really accurate from a parsing standpoint (the grammar essentially enforces that where a division would occur, a regex would be invalid, and vise-versa), I can agree with the visual ambiguity.

I'm not sure what I would prefer as an alternative, though. Elixir uses a ~r sigil, which isn't too bad.

faultyserver commented 7 years ago

While I was originally against them, Sigils are a decent solution for non-literal literals (regexes, datetimes as shown in #5, etc.).

I still don't want to implement them in quite the same way (user-defined sigils seem a bit messy), but I could see having a few defined by the language.

faultyserver commented 6 years ago

Just a slight variation on sigils: r/[A-Z][a-z]*/.

I like the lack of tilde, it's also not syntactically ambiguous with division.

Jens0512 commented 6 years ago

Im all for the r/[A-Z][a-z]*/ approach.

In this examle i'll call it sigil because I've no better name atm. It'd be better if we came up with some cool name of our own imo.

I think that it should be possible for people to define their own sigil x/foo/bar/.../g, where

x is the name of the sigil.
foo is an argument.
bar is another argument.
g is an option (difference between args and options are that options does not have the trailing /.

The smallest possible sigil defineable should be x//, where x takes no args or options

So just like regexes really, the main difference being the x, and ability to take multiple arguments (x/M/y/s/t/), this approach is common in editors like vim, sed, etc.

For example we could (pretending we have a String#sub method [which we really should have]) define the sigil s (for substitution) somehow like this:

sigil s(in : String, [String, String] =: args, opts : List)
  when opts.includes? "g"
    return in.gsub(args[0], args[1])
  else
    return in.sub(args[0], args[1])
  end
end

(Sorry about all my Myst syntax problems.) With this "Foo" =~ s/o/u/g # => "Fuu", and "Foo" =~ s/o/u/ # => "Fuo" The idea here is that the =~ operator feeds the left side value into the right hand sigil as in. (Or the other way around, and with the := operator like in the first example in this issue. [Or both.]) And calls the the sigil with in, args and opts.

Note that all of this is just something that popped into my head while reading (@faultyserver)s earlier suggestions, and I just posted them during class. Take everything with a grain of salt.

WDYT?

faultyserver commented 6 years ago

While I generally like your idea for implementing sigils, I don't think it would fit well in Myst. The sigil style itself has a pretty universally understood pattern, and is a very clean way of representing textual operations, even outside of text editors. But, applying that into expressions for a programming language seems a bit awkward.

I've never really been a fan of the =~ operator. To me it doesn't visibly look like what it actually does (though, the same could be said for =:, which I do like...). I also find it particularly hard to instantly know what the return of the expression should be when using this operator.

I'm also not entirely sold on the combination of that operator with the sigil, which is essentially another operator in the expression. In particular, with the example you've given, it's not really any shorter or cleaner than just using the sub or gsub method, but still adds the visual overload of two "operators" next to each other:

"Foo" =~ s/o/u/g
"Foo".gsub('o', 'u')

Obviously this is just one example and it could be more effective in other cases, but I much prefer the obvious call and implied operation of the method syntax over the operator+sigil style.

If we can come up with a different way to apply the sigil to the receiver, I would gladly be open to it.

ron-wolf commented 5 years ago

=: may not look like what it does, but it does look like something with a similar meaning, i.e., :=. So I can see the sense in it.

And for the record, I really like the pattern-matching syntax you suggested. It might be difficult to implement, but if that can be done without complicating the backend logic too much, then it will have been worth the effort. Do you have an idea of which file(s) would have to be updated to process this type of syntax?

faultyserver commented 5 years ago

From a parsing standpoint, everything would live in src/myst/syntax/parser.cr, adding a new clause to the parse_literal method. A new AST node, RegexLiteral, would need to be added as a subclass of Literal as well in ast.cr.

A REGEX_START token type for the r/ syntax would need to be added to token.cr, and then it needs to be implemented in lexer.cr (this is a bit difficult to do really generically. Id be okay with just matching r then / for now like how other two character tokens are being matched).

Interpreter-wise, src/myst/interpreter/matcher.cr needs a new clause to handle matching regex literal patterns. I think having regexes be their own primitive, immutable value type is good, so value.cr needs to add the shim methods to RegExp and that needs to be added to the MTValue union. nodes/literals.cr should then have a clause matching RegexLiteral nodes and creating a RegExp value from it (could use __value_from_literal here).

And of course specs for all of these changes in their respective spec files, which should all exist and be named appropriately in the spec folder.

That’s all that i can think of that needs updating at a glance. It’s quite a lot but most of the changes are pretty minor.

myst-lang / myst

Regex literals #4