neugram / ng

scripting language integrated with Go
https://neugram.io
BSD 2-Clause "Simplified" License
917 stars 43 forks source link

consider using mvdan/sh as a shell parser #9

Open crawshaw opened 7 years ago

crawshaw commented 7 years ago

Neugram has its own shell syntax parser. It's complicated, and not complete.

There is a nice shell implementation in https://github.com/mvdan/sh (by @mvdan) we could use. It would be great to reduce the lines of code in Neugram.

Potential issues:

Worth investigating at some point.

mvdan commented 7 years ago

Passing control from one tokenizer to another and back could lead to some tricky glue code

The parser currently takes an io.Reader and does some buffering, so it could get tricky. It also needs to buffer the next full rune to be able to use peeks to make proper lexing and parsing decisions.

The silliest way to solve this would be to know what bytes belong to the shell parser and which don't, before it is even fired. For example, if you said that the first occurrence of $$ ends the shell code - even if it is part of a string. That would also remedy the third point about $$. Not saying it's a good solution, of course. Note that one could still use the actual $$ variable like ${$}, avoiding the delimiter string.

Neugram shell blocks end on $$, a valid shell word. This may need a hack in the shell parser.

You could definitely maintain a small patch to tell the parser to force an EOF as soon as it sees a word with just $$.

In the spirit of maintaining forks and patches to a minimum, let me try to help with alternatives. Connecting to "not interested in control flow syntax", it sounds like you're really just interested in parsing shell words and not much else.

If that is the case, I could add a mode for the parser to just parse words and nothing else. However, I see that you're also interested in pipes and redirections, so I think this would not be at the right level for you.

In terms of shell grammar (see 2.10.2 in http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html), you're in an awkward spot. You're basically at pipe_sequence (skipping && and ||), but you don't want compound commands (if, for, {, etc).

From your point of view, the easiest would be a callback func called for every parsed word that would let you make the parser stop. I guess that would be generic enough to be useful in other scenarios. We could go one step further and make it be func(syntax.Node) error, allowing you to error on control flow nodes too, but I get the feeling that a if fn != nil { fn(node) } for every node would slow down the parser quite a bit.

A lot of the complexity of shell is the evaluator, which we won't get to use.

I'm not sure how advanced your evaluator is - but see https://github.com/mvdan/sh/tree/master/interp.

The idea is to make it modular and heavily configurable; see https://github.com/mvdan/sh/issues/147. For example, at the moment it's possible to mock all program executions or (direct) file open calls. If the same was available for variable handling, it would allow you to "insert" the Go variables into the shell interpreter too. Would anything else be needed?

I would see the interpreter as step 2, but you could also argue that rewriting your interpreter with my syntax package is work that you could avoid by doing everything at once.

mvdan commented 6 years ago

@crawshaw I have just added a feature that should fix the $$ delimiter issue:

p := syntax.NewParser(syntax.StopAt("$$"))
file, err := p.Parse(strings.NewReader("foo bar $$"), "")
// you'll end up with "foo bar" in the AST

See the godoc for that new option. It acts on the lexer, so it works with all the examples in your shell guide doc.

This also makes the "Passing control from one tokenizer to another" issue simpler. Assuming that all the source is in a buffer, you could send a slice of that as a reader (with the read offset copied), and it would automatically stop at either the end or $$.

The tricky bit then is to figure out when to start your original tokenizer again. One possible solution would be to grab file.End().Offset(), and looking for the first $$ after that. I can't think of a case where this would break.

So this, plus walking the AST to block the node types you don't want, should get you what you want. Then I think you could use interp pretty much as-is. The only custom behaviour that I see is that Neugram variables pass on to the shell, which you could do with Runner.Env. I might change Env []string to be a func(string) string in the future.

mvdan commented 6 years ago

Also, another byproduct of using this package would be that the language would become plain Bash (or plain POSIX Shell, whichever you prefer). So for example, ${param[offset:length]} would become ${param:offset:length:}, and ${param/regexp/replacement} would use a shell pattern instead of a regex.

And you might get more stuff implemented in the shell, as there's a ton to reimplement. To get an idea of what there currently is, you can check the table-driven tests: https://github.com/mvdan/sh/blob/master/interp/interp_test.go

mvdan commented 6 years ago

I'm working on a prototype pull request at the moment to replace the shell parser. It almost works - it only has a bug in an edge case where the two lexers step over each other. Hopefully I'll fix that last bug and post the PR soon.