Open leodemoura opened 3 years ago
@Kha you were more active in the Zulip thread about this issue. Do you have further remarks, details, etc? BTW, what would be the notation for raw strings? Would we still have escaped characters for raw strings?
I don't think there was any consensus on raw strings, but personally my favorite syntax is Rust's where you can just add more #
s if you still have a collision with the string contents: https://doc.rust-lang.org/reference/tokens.html#characters-and-strings. Not trivial to implement though.
I think raw strings should also turn off escapes. I would hope that there is not sufficient need for a syntax that allows escapes but no interpolation.
What makes the rust style raw strings more difficult to implement? It's not a regular grammar, but any grammar that can handle parentheses should be able to express it no? s(#^n).*?(#^n)
My concern was over the FirstTokens
set, but it should probably be a new token parser (activated by r[#"]
), in which case it can do whatever it wants.
What makes the rust style raw strings more difficult to implement? It's not a regular grammar, but any grammar that can handle parentheses should be able to express it no?
s(#^n).*?(#^n)
Note that your ?
is a context-sensitive disambiguator: .*
can include #^k
for k
both smaller and greater than n
and it's simply impossible for a CFG to tell them apart. Even without the ?
there's arguably still an implicit disambiguator in that most regex engines will only give you the "the longest match".
At best you would have an ambiguity-preserving parse forest (à la SPPF from some Earley/GLR/GLL/etc. general CFG parser), and then disambiguate raw strings on that. (amusingly, the disambiguation can be a noop if #^n
doesn't repeat in the entire input, or if all the other choices cause parse errors after their presumed end of the raw string - not so funny if the correct choice has its own parse errors though, since you have to disambiguate to know what to report)
But if you already have more flexible tokenization (as seems to be the case here), you don't have to go through all that complication, and you can implement the counting directly. (there could be benefits to parse forests if you didn't already have a non-trivial recursive descent parser, but generally it's a messy tradeoff IME)
For more background on Rust's own raw strings context-sensitivity:
Sebastian and I were discussing this today, and I had a few thoughts and concerns that he thought I should document here:
to_additive
attribute takes a string with a docstring to apply to the additivized declaration). We would at least need a function like Lean.Syntax.isStrLit?
that gets a non-interpolated string literal, but which generates an error if it's an interpolated string.{
's in some cases can be annoying (for example, either this docstring case or, one I've had experience with before, writing strings that generate LaTeX code). I'm not sure the complexity is justified, but it's worth considering. What would the prefix be? Perhaps raw strings are sufficient for all such use cases -- that said, Python's raw interpolated strings are sometimes useful.logInfo "e = {e}"
, they would be surprised it has the logInfo s!"e = {e}"
meaning instead of the logInfo m!"e = {e}"
meaning (this is enabled by the String -> MessageData
coercion). It seems string literals would need an InterpString
typeclass to be able to choose meanings based on the expected type. Presumably the String
interpretation would be a default_instance
?(Re the raw string discussion, there is an implementation of them at #2929.)
The notation
"..."
would now behave likes!"..."
, and we would have a notation for raw strings.