rhaiscript / lsp

Language server for Rhai.
Apache License 2.0
43 stars 4 forks source link

Process interpolated code in strings #46

Closed tamasfe closed 2 years ago

tamasfe commented 2 years ago

Currently all strings are processed as-is, and interpolations are ignored.

I don't know yet where to process these, as they will complicate the parse tree a whole lot.

schungx commented 2 years ago

Currently all strings are processed as-is, and interpolations are ignored.

It would be dangerous, as there can be nested interpolations. If you just count braces, you may end up going wrong.

`This is first level ${let s = `this is second level ${x}`; s} so there.`

I don't know yet where to process these, as they will complicate the parse tree a whole lot.

What I do is, when I see a `, I read a string until I get another ` (closing) or ${. If ${, then I return the previous string as a partial segment, and start parsing a statement block (assuming that I have seen a {). When when block ends with }, then I start another partial string segment. Repeat.

You can simply make the $ (when it is followed by {) into ` that terminates the previous string literal, and then just start parsing a normal statements block.

Then the parse tree is simply an array of segments + statements block alternatively. Internally, I keep them all as Expr type, with the string segment simply mapping to a string literal.

schungx commented 2 years ago

After thinking about this some more, I have an idea to fit the parsing of interpolated strings into the current LSP structure. As I understand after Googling a bit, tokenizing an interpolated string is non-trivial.

The idea is to parse the interpolated string as a string of tokens, essentially like what is done with the current Rhai parser.

Upon seeing `, the tokenizer should parse until it sees either:

In your grammar, you need a special rule for interpolated strings:

Lit =
  'lit_int'
| 'lit_float'
| String
| 'lit_bool'
| 'lit_char'

String =
  'lit_str'
  'lit_interp_str' '${' Expr '}'  'lit_str'
  'lit_interp_str' '${' Expr '}'  'lit_interp_str'

So, for example, the following:

`The answer is ${`an ${if answer.is_even() { "even number" } else { "odd number" }}` + answer}.  QED.`

Probably gets parsed into:

lit_interp_str = "The answer is "

Expr: +
    String:
        lit_interp_str = "an "
        Expr if
        lit_str = ""
    Expr: answer

lit_str = ". QED."
tamasfe commented 2 years ago

Yeah looks like I'll incorporate it into the parse tree anyway.

I was hoping I could just parse it as a string literal between the ` then lazily process it further without disturbing the existing grammar, but yeah as you mentioned there can be a lot of edge cases as everything can be arbitrarily nested.

tamasfe commented 2 years ago

I'll tag this as hard instead as it'll need special care in the HIR as well, will get back to this once everything else works.

schungx commented 2 years ago

That's correct. The trick, it seems, is to convert the nested embedded expressions into structured syntax that can be parsed simply with a grammar.

So it seems like an interpolated string is nothing but a fancy function call, sort of. You simply have literal strings instead of comma's separating expressions, and the first piece from ` till ${ as the function name.

schungx commented 2 years ago

So, just to think aloud along that idea... take the following interpolated string:

`The answer is ${`an ${if answer.is_even() { "even number" } else { "odd number" }}` + answer}.  QED.`

We parse it as if we have:

str_interp_the_answer_is(
    { str_interp_an(  if answer.is_even() { "even number" } else { "odd number" } , "" )  + answer } ,
 // ^ this must be parsed as a statements block as there may be multiple statements inside
 //                                                                                 ^ the last segment is empty
    ". QED"
)

In other words, the grammar rules for an interpreted string should be exactly the same as a function call with arguments, and function calls can naturally occur within arguments.

You tokenize the stream in this form:

"The answer is " $ {
    "an " $ {
        if answer.is_even() { "even number" } else { "odd number" }}
    }
    ""
    +
    answer
}
".  QED."

That would require you manually push a ` into the input stream once you finish the parsing of an interpolated block (i.e. after ending the block with the last }). This seems to be the only requirement.