String interpolation by default

leodemoura commented 3 years ago

The notation "..." would now behave like s!"...", and we would have a notation for raw strings.

leodemoura commented 3 years ago

@Kha you were more active in the Zulip thread about this issue. Do you have further remarks, details, etc? BTW, what would be the notation for raw strings? Would we still have escaped characters for raw strings?

Kha commented 3 years ago

I don't think there was any consensus on raw strings, but personally my favorite syntax is Rust's where you can just add more #s if you still have a collision with the string contents: https://doc.rust-lang.org/reference/tokens.html#characters-and-strings. Not trivial to implement though. I think raw strings should also turn off escapes. I would hope that there is not sufficient need for a syntax that allows escapes but no interpolation.

DanielFabian commented 3 years ago

What makes the rust style raw strings more difficult to implement? It's not a regular grammar, but any grammar that can handle parentheses should be able to express it no? s(#^n).*?(#^n)

Kha commented 3 years ago

My concern was over the FirstTokens set, but it should probably be a new token parser (activated by r[#"]), in which case it can do whatever it wants.

eddyb commented 2 years ago

What makes the rust style raw strings more difficult to implement? It's not a regular grammar, but any grammar that can handle parentheses should be able to express it no? s(#^n).*?(#^n)

Note that your ? is a context-sensitive disambiguator: .* can include #^k for k both smaller and greater than n and it's simply impossible for a CFG to tell them apart. Even without the ? there's arguably still an implicit disambiguator in that most regex engines will only give you the "the longest match".

At best you would have an ambiguity-preserving parse forest (à la SPPF from some Earley/GLR/GLL/etc. general CFG parser), and then disambiguate raw strings on that. _{(amusingly, the disambiguation can be a noop if #^n doesn't repeat in the entire input, or if all the other choices cause parse errors after their presumed end of the raw string - not so funny if the correct choice has its own parse errors though, since you have to disambiguate to know what to report)}

But if you already have more flexible tokenization (as seems to be the case here), you don't have to go through all that complication, and you can implement the counting directly. _{(there could be benefits to parse forests if you didn't already have a non-trivial recursive descent parser, but generally it's a messy tradeoff IME)}

For more background on Rust's own raw strings context-sensitivity:

https://github.com/rust-lang/reference/pull/1185
the original explanation/proof document (replicating/redoing it in Lean itself could be a fun exercise)

kmill commented 11 months ago

Sebastian and I were discussing this today, and I had a few thoughts and concerns that he thought I should document here:

Sometimes string literals appear in syntactic locations that aren't expressions per se (in mathlib for example the to_additive attribute takes a string with a docstring to apply to the additivized declaration). We would at least need a function like Lean.Syntax.isStrLit? that gets a non-interpolated string literal, but which generates an error if it's an interpolated string.
We could consider having a prefix for specifically non-interpolated string literals, since escaping {'s in some cases can be annoying (for example, either this docstring case or, one I've had experience with before, writing strings that generate LaTeX code). I'm not sure the complexity is justified, but it's worth considering. What would the prefix be? Perhaps raw strings are sufficient for all such use cases -- that said, Python's raw interpolated strings are sometimes useful.
If a user writes logInfo "e = {e}", they would be surprised it has the logInfo s!"e = {e}" meaning instead of the logInfo m!"e = {e}" meaning (this is enabled by the String -> MessageData coercion). It seems string literals would need an InterpString typeclass to be able to choose meanings based on the expected type. Presumably the String interpretation would be a default_instance?

(Re the raw string discussion, there is an implementation of them at #2929.)

leanprover / lean4

String interpolation by default #407