RFC: Syntax for raw string literals

lilyball commented 11 years ago

A raw string literal is a string literal that does not interpret any embedded sequence, meaning no backslash-escapes. A lot of languages (certainly most that I've used) support some syntax for raw string literals. They're useful for embedding any string that wants to have a bunch of backslashes in it (typically because the function the string is passed to wants to interpret them itself), such as regular expressions. Unfortunately, Rust does not have a raw string literal syntax.

There's been a discussion on the mailing list for the past few days about this. I will try to put a quick summary here.

There's two questions at stake. The first is, should Rust have a raw string literal syntax? The second is, if so, what particular syntax should be used? I think the answer to the first is definitely Yes. It's useful enough, and has enough overwhelming precedence in other languages, that we should add it. The question of concrete syntax is the harder one.

The syntaxes that have been proposed so far, along with their Pros and Cons:

C++11 syntax, e.g. R"delim(raw text)delim".

Pros:
- Reasonably straightforward
- Can embed any character sequence
Cons:
- Syntax is slightly complicated (editorial note: I think any syntax that's flexible enough to contain any character is going to be considered slightly complicated).
Python syntax, e.g. r"foo"

Pros:
- Simple syntax
Cons:
- Can't embed any character sequence.
- Python's implementation has really wacky handling of backslash escapes in conjunction with the quote character. Even reproducing that behavior does not allow for embedding any sequence, as r"foo\"" evaluates to the string foo\" (with the literal backslash).
D syntax, e.g. r"raw text", raw text, or q"(raw text)"/q"delim\nraw text\ndelim"

Pros:
- Can embed any character sequence (with the third variant)
Cons:
- The first two forms aren't flexible enough, and the third form is a bit confusing. The delimiter behaves differently depending on whether it's a "nesting" delimiter (one of ([<{), another token, or an identifier.
C#/SQL/something else, using a simple raw string syntax such as r"text" where doubling up the quote inserts a single quote, as in r"foo""bar"

Pros:
- Simple syntax
Cons:
- Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML).
Perl quote-like operators, e.g. q{text}. Unfortunately, most viable delimiters will result in an ambiguous parse.
Ruby quote-like operators, e.g. %q{text}. Unfortunately, this also is ambiguous (with the % token).
Lua syntax, e.g. [=[text]=]

Pros:
- Simple syntax
- Can embed any character sequence
Cons:
- Syntax looks decidedly non-string-like
- Custom delimiters are limited to sequences of =
- Alex Chrichton opined that seeing println!([[Hello, {}!]], "world") in an introduction to Rust would be awfully confusing (see previous point about being non-string-like).
Go syntax, e.g. raw text. This is one of the variants of D strings as well

Pros:
- Simple syntax
Cons:
- Cannot embed any character sequence (notably, cannot embed backtick)
- It's difficult or impossible to embed backticks in a markdown code sequence, which will make it awkward to use raw strings in markdown editors. May also be confusing with the usage of foo in doc comments.
A new syntax using ASCII Control characters STX and ETX

Pros:
- I don't think there are any
Cons:
- Can't type the keys on any keyboard
- Text editors probably won't render the characters correctly either
- Can't technically embed any character sequence, because ETX cannot be embedded, but in fairness it can embed any printable sequence.
A syntax proposed over IRC is delim"raw text"delim.

Pros:
- Can embed any character
Cons:
- Unusual syntax with no precedent in other languages. Functionally identical to C++11 syntax.
- Hard to type in Markdown editors

Some form of Heredoc syntax was also suggested, but heredocs are really primarily concerned with embedding multiline input, not raw input. They also have issues around dealing with indentation and the first/last newline.

During this discussion, only two Rust team members (that I'm aware of) chimed in. Alex Chricton raised issues with the Lua syntax, and threw out the suggestion of Go's syntax, though only as something to consider rather than a recommendation. Felix Klock expressed a preference for C++11 syntax, and more generally stated that he wants a syntax with user-delimited sequences. There was also at least one community member in favor of C++11 syntax.

My own preference at this point is for C++11 syntax as well. At the very least, something similar to C++11 syntax, that shares all of its properties, but there seems to be no value in inventing a new syntax when there's precedent in C++11.

thestinger commented 11 years ago

I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem.

lilyball commented 11 years ago

@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form <start token><user-supplied delimiter><delimiter end token><raw text><delimiter start token><user supplied delimiter><end token>. In C++11 the <start token> is R", the <delimiter end token> is (, the <delimiter start token> is ), and <end token> is ". I don't think you can remove any of these components without breaking functionality, and I don't think you can adjust the values to produce something less noisy.

The only adjustment I can think of would be to remove <delimiter start/end token> and instead require that the <user supplied delimiter> includes an appropriate punctuation to end it (and to start the close sequence), but if anything that makes it more confusing, not less.

ghost commented 11 years ago

How about R<space><user-supplied delimiter>"<raw text>"<user-supplied delimiter> and R"<raw text>" for the case where a simple " delimiter is sufficient.

R"c:\some\path\" R eos"raw text"eos

lilyball commented 11 years ago

@stevenashley The lexer will see the R<space> and tokenize that as an identifier R. Trying to look ahead past spaces is, at best, highly confusing.

ghost commented 11 years ago

Ah, of course. I can't think of a substitute for <space> that would simultaneously look nice and parse well. Consider my proposal retracted.

Kimundi commented 11 years ago

How about this: r"" syntax, with the option to pad the string on both ends with #:

foo  ==  r"foo"
fo"o ==  r#"fo"o"#
##   ==  r###"##"###

As far as I know we don't allow # in an expression context, it's only valid as part of the attribute syntax, so this should work. Heck, it would even be ok in attributes themself, I think:

#[foo = r##"test"##];

Alternatively, we could also throw away the r token itself and say that any number of # followed by " starts an raw string literal:

let regex = ##"[/s]+"##;

Or we make both forms valid: r"" for short raw strings, ##""## as alternative to cover every possible string.

lilyball commented 11 years ago

@Kimundi That looks like a reinvention of Lua's [==[foo]==] syntax. It's certainly workable, but it shares the same problems that Lua syntax does (as pointed out by @alexcrichton).

alexcrichton commented 11 years ago

It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:

Regular expressions. These contain lots of backslashes and normally escapes aren't even really that necessary. If we used the normal string syntax, everything would have to be double-escaped which is a pain. The main stickler about this desired syntax is that this would want to be very usable (in the sense that it shouldn't be a pain to write/read read the strings of regular expressions, at least no more than it already is).
Literal windows paths. Perhaps these should be done in a different manner to be portable, but regardless having to escape the \ character is a real pain. As with regular expressions, in theory the string syntax isn't difficult to read.
Giant blobs of raw text, such as formatting an HTML document (like what rustdoc does right now). This is different from regular expressions in that I they don't need to be so easily readable (because the body of the text is normally very large), so the custom delimiters surrounding the text I believe would be find in this case.
format! string directives. Right now it's a pain to print a \ character because you have to type println!("\\\\"). As with regular expressions though, this should be easy to read and easy to use (because it may be fairly common). Perhaps this should use a different escape though, which would make this irrelevant.

Those are the use cases that I could think of, others may have more

ghost commented 11 years ago

Would case 4 become println!(R"(\\)") ?

alexcrichton commented 11 years ago

I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes.

Kimundi commented 11 years ago

@kballard I would say it's better than Luas syntax here.

It has the same advantage of being able to delimit any text.
Only being limited to # is not a problem, you can still find a delimiter sequence for any input.
The default case r"" has very low typing overhead, and looks very similar to a regular string literal, no confusion about meaning.

Looking at @alexcrichton's use cases:

Regular expressions: r"([^:]*):(\d+):(\d+): (\d+):(\d+) (.*)$".match_groups()
Windows paths: r"C:\Program Files\rust\bin\rust.exe".to_path()
format! strings: println!(r"\\");
Blobs of text:

static MARKDOWN: &'static str = r###"
## Scope

This is an introductory tutorial for the Rust programming language. It
covers the fundamentals of the language, including the syntax, the
type system and memory model, generics, and modules. [Additional
tutorials](#what-next) cover specific language features in greater
depth.

This tutorial assumes that the reader is already familiar with one or
more languages in the C family. Understanding of pointers and general
memory management techniques will help.
"###;

alexcrichton commented 11 years ago

Those all look totally reasonable to me.

ghost commented 11 years ago

Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done.

huonw commented 11 years ago

(All of these mean the token language is no longer regular, right?)

lilyball commented 11 years ago

@Kimundi: Regular expression:

r##"(\w+)   # match word chars
    "[^"]*" # followed by a quoted string
    (\d+)   # followed by digits"##.flag("x").match_groups();

# just looks like a very odd character here.

That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it.

@huonw I believe you are correct. Is that particularly important?

pnkfelix commented 11 years ago

The restriction to sequences of # has some bad corner cases, where I can imagine one sitting and manually counting, since the eye does not immediately distinguish and/or match { #####, ###### and ##### } the same way it can with { #five# #six# and #five# }.

I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to ##*) up to an end-developer policy (with checks for particular restrictions available as a lint).

The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp #(#|_)*, or perhaps even (#|_)*" (Not 100% sure whether the latter is too broad.) Then I still get to write e.g. { #_#, ##_, #_# } which is easier on my eyes than the above encodings of five and six.

But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of #.

(one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) )

Kimundi commented 11 years ago

@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two #: It's only necessary to add more if your raw string literally contains "#, "##, "### etc.

@kballard: Likewise, in that example there would be no need for more than one #:

r#"(\w+)   # match word chars
   "[^"]*" # followed by a quoted string
   (\d+)   # followed by digits"#.flag("x").match_groups();

Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one.

Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;)

pnkfelix commented 11 years ago

@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support.

E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.)

I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...)

ghost commented 11 years ago

@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded.

I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: [rR](#*)"(.*?)"\1.

A regex that parses #five# etc is a little more complex but still workable. [rR](#*)([^"]*)\1"(.*?)"\1\2\1.

Kimundi commented 11 years ago

@pnkfelix @huonw: You could also just hack around that: If we pick a syntax that only differs in length, like my proposal, then external tools could hardcode, say, up to five variations. I don't think there are many cases in the wild that embedded the the string "#####.

Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure.

lilyball commented 11 years ago

@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all.

Kimundi commented 11 years ago

@kballard Right, just wanted to throw that out there as fallback workaround. :)

Kimundi commented 11 years ago

Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC:

Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing " or ending with whitespace between r# and " (The initial # being needed to make the lexer recognize it as an raw string literal).

Would certainly give good opportunities for self documenting literals:

static RUSTCODE: &'static str = 
r## CODE ##"
fn main() {
    // Example: This uses a string raw literal to embed an windows-style file path directly.
    println(r"C:\Program Files\rust\bin\rust.exe");
}
"## CODE ##;

lilyball commented 11 years ago

@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source.

steveklabnik commented 11 years ago

Ruby also uses ' to not interpret, and " to interpret.

a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}"

lilyball commented 11 years ago

@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using 'FOUR' syntax.

steveklabnik commented 11 years ago

@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at.

lilyball commented 11 years ago

According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular).

sp3d commented 11 years ago

I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes.

So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)."

No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check.

Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error.

I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data.

ben0x539 commented 11 years ago

I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like printf's and lua's %) over overloading \ to serve that role in way different contexts and then requiring people to select the right string literal syntax for every context. I realise that can't address the use case of hardcoded Windows paths. For embedding output by other programs into rust source code, I think it's reasonable to just pipe them through an adaptor first that properly escapes them if an include!() macro isn't appropriate there.

I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :(

lilyball commented 11 years ago

@sp3d

No scheme will pass through verbatim every character in the source sequence for all sequences.

Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately.

@ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal).

sp3d commented 10 years ago

@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step".

In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code.

lilyball commented 10 years ago

@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up.

The problem with the r"foo""bar" solution is that even for short strings, it really sucks when the string contains many quotes. For example if I need a raw string that contains a snippet of code ["this", "is", "a", "vector", "of", "&str"] then it's pretty bad: r"[""this"", ""is"", ""a"", ""vector"", ""of"", ""&str""]".

MaddieM4 commented 10 years ago

We could always do something like r(delim)textdelim, for example, r(;)Some text that is terminated by a semicolon;. Not sure that's very readable, but it definitely seems easy to parse, and doesn't require the use of " characters.

pnkfelix commented 10 years ago

@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through.

lilyball commented 10 years ago

@pnkfelix Huzzah! I'm quite happy to do the implementation myself

Kimundi commented 10 years ago

I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm.

lilyball commented 10 years ago

@Kimundi Yes I'm pretty sure it can be done in next_token_inner().

ben0x539 commented 10 years ago

@Kimundi I have most of a patch already, can we talk/compare notes on irc or so?

alexcrichton commented 10 years ago

Closed by #9674, nice work everyone!

boosh commented 7 years ago

r#""# really was a poor choice of delimiter. Didn't anyone think people might want to quote HTML which potentially contains loads of '"#' substrings? 👎

jonas-schievink commented 7 years ago

@boosh You can use an arbitrary number of # on both sides

boosh commented 7 years ago

@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍

jibal commented 5 years ago

Excellent choice. For my own language design I've looked at every syntax out there as well as coming up with several of my own, and this is the least verbose and complex, while also allowing any string to be delimited. Fortunately, bad arguments, like the one for doubling ", were rejected. It's not true that "the only important attribute is a readable way to hold backslashes" -- both backslashes and double quotes are special characters in normal string literals. And the claim that "you must constantly worry about updating delimiters" is simply false. You will never in practice encounter a text containing "##.

I actually somewhat prefer Kimundi's second proposal, which is a superset of the one adopted: rX"text"X, where X is either empty or is any sequence starting with # ... of course it can be abused, but so can anything, and there's no need to abuse it, and I believe in giving programmers freedom.

rust-lang / rust

RFC: Syntax for raw string literals #9411