Closed lilyball closed 10 years ago
I think functionality equivalent to the C++11 syntax is best, but ideally not as noisy. We also need to consider how text editor syntax files will handle it, but I don't think it will be too much of a problem.
@thestinger Do you have any suggestions? Functionally equivalent to C++11 requires something of the form <start token><user-supplied delimiter><delimiter end token><raw text><delimiter start token><user supplied delimiter><end token>
. In C++11 the <start token>
is R"
, the <delimiter end token>
is (
, the <delimiter start token>
is )
, and <end token>
is "
. I don't think you can remove any of these components without breaking functionality, and I don't think you can adjust the values to produce something less noisy.
The only adjustment I can think of would be to remove <delimiter start/end token>
and instead require that the <user supplied delimiter>
includes an appropriate punctuation to end it (and to start the close sequence), but if anything that makes it more confusing, not less.
How about R<space><user-supplied delimiter>"<raw text>"<user-supplied delimiter>
and R"<raw text>"
for the case where a simple " delimiter is sufficient.
R"c:\some\path\"
R eos"raw text"eos
@stevenashley The lexer will see the R<space>
and tokenize that as an identifier R
. Trying to look ahead past spaces is, at best, highly confusing.
Ah, of course. I can't think of a substitute for <space>
that would simultaneously look nice and parse well. Consider my proposal retracted.
How about this: r""
syntax, with the option to pad the string on both ends with #
:
foo == r"foo"
fo"o == r#"fo"o"#
## == r###"##"###
As far as I know we don't allow #
in an expression context, it's only valid as part of the attribute syntax, so this should work.
Heck, it would even be ok in attributes themself, I think:
#[foo = r##"test"##];
Alternatively, we could also throw away the r
token itself and say that any number of #
followed by "
starts an raw string literal:
let regex = ##"[/s]+"##;
Or we make both forms valid: r""
for short raw strings, ##""##
as alternative to cover every possible string.
@Kimundi That looks like a reinvention of Lua's [==[foo]==]
syntax. It's certainly workable, but it shares the same problems that Lua syntax does (as pointed out by @alexcrichton).
It's probably also worth re-mentioning the various use-cases which have a desire for some sort of syntax that is not what we have today:
\
character is a real pain. As with regular expressions, in theory the string syntax isn't difficult to read.format!
string directives. Right now it's a pain to print a \
character because you have to type println!("\\\\")
. As with regular expressions though, this should be easy to read and easy to use (because it may be fairly common). Perhaps this should use a different escape though, which would make this irrelevant.Those are the use cases that I could think of, others may have more
Would case 4 become println!(R"(\\)")
?
I suppose under the C++ syntax that's what it would be, which is arguably just as confusing as four backslashes.
@kballard I would say it's better than Luas syntax here.
#
is not a problem, you can still find a delimiter sequence for any input.r""
has very low typing overhead, and looks very similar to a regular string literal, no confusion about meaning.Looking at @alexcrichton's use cases:
r"([^:]*):(\d+):(\d+): (\d+):(\d+) (.*)$".match_groups()
r"C:\Program Files\rust\bin\rust.exe".to_path()
format!
strings: println!(r"\\");
static MARKDOWN: &'static str = r###"
## Scope
This is an introductory tutorial for the Rust programming language. It
covers the fundamentals of the language, including the syntax, the
type system and memory model, generics, and modules. [Additional
tutorials](#what-next) cover specific language features in greater
depth.
This tutorial assumes that the reader is already familiar with one or
more languages in the C family. Understanding of pointers and general
memory management techniques will help.
"###;
Those all look totally reasonable to me.
Me also. I prefer Kimundi's proposed syntax over C++11 syntax. Nicely done.
(All of these mean the token language is no longer regular, right?)
@Kimundi: Regular expression:
r##"(\w+) # match word chars
"[^"]*" # followed by a quoted string
(\d+) # followed by digits"##.flag("x").match_groups();
#
just looks like a very odd character here.
That said, I am not as adverse to this syntax as I was initially. While I think it looks weird, and it would feel weird every time I type it, I would be ok with using it.
@huonw I believe you are correct. Is that particularly important?
The restriction to sequences of #
has some bad corner cases, where I can imagine one sitting and manually counting, since the eye does not immediately distinguish and/or match { #####
, ######
and #####
} the same way it can with { #five#
#six#
and #five#
}.
I would personally prefer C++11 (or any variant that does not restrict the user-selected token sequence to such an impoverished alphabet), and instead leave restrictions (e.g. to ##*
) up to an end-developer policy (with checks for particular restrictions available as a lint).
The theoretician in me wants to say "here's a compromise: the end user sequence is strings drawn from a two element alphabet, for example the regexp #(#|_)*
, or perhaps even (#|_)*
" (Not 100% sure whether the latter is too broad.) Then I still get to write e.g. { #_#
, ##_
, #_#
} which is easier on my eyes than the above encodings of five and six.
But it is not a big deal to me; its certainly not as important as just having some choose-your-own delimiter option, even if it did end up being solely drawn from strings of #
.
(one last note: I realized after I wrote this that I misrepresented kimundi's proposal slightly, since kimundi's proposal is not a mere restriction of the C++11 proposal, so its not as if we could start with C++11 and just add a lint. But I think the rest of my note holds up. Especially the last part, where I said its not a big deal to me. :) )
@pnkfelix: All fair points, however I think in practice you'd never need to have more than one or two #
: It's only necessary to add more if your raw string literally contains "#
, "##
, "###
etc.
@kballard: Likewise, in that example there would be no need for more than one #
:
r#"(\w+) # match word chars
"[^"]*" # followed by a quoted string
(\d+) # followed by digits"#.flag("x").match_groups();
Personally, I'm weary of the "any string as delimiter" approach: It can more easily lead to inconsistencies and style issues because every literal might use a different one.
Restricting it to one character at least restricts the possibly variations to one dimension, the length, and that people will tend to make as short as possible. ;)
@huonw's point (that a choose-your-own-delimiter implies non-regular token language) might be important, depending on what our lowest common denominator is for tool support.
E.g. if some IDE only supports regular tokens for its syntax highlighting. (Or a better example: If we don't want to put in the effort necessary to figure out how to handle non-regular languages on all the major IDEs that we hope to support.)
I'll try to bring this up at the weekly meeting on Tuesday, solely to determine whether whether a regular token language is a hard constraint or not. (That is, I hope to avoid a bikeshed during the meeting...)
@huonw Yep. Raw strings are not embeddable within a regular language as it means that the string terminator must also be regular. A document containing a terminator would be unable to be embedded.
I don't think it is a big problem as they are parsable by any regex engine supporting back references and non-greedy matching. For example: [rR](#*)"(.*?)"\1
.
A regex that parses #five#
etc is a little more complex but still workable. [rR](#*)([^"]*)\1"(.*?)"\1\2\1
.
@pnkfelix @huonw: You could also just hack around that:
If we pick a syntax that only differs in length, like my proposal, then external tools could hardcode, say, up to five variations. I don't think there are many cases in the wild that embedded the the string "#####
.
Of course, that only "really" works if the failure case is something inconsequential like syntax highlighting failure.
@Kimundi Given the number of non-regular languages out there (lots of languages have some equivalent of either raw strings or heredocs), I would be surprised if any tools would need hacks like that at all.
@kballard Right, just wanted to throw that out there as fallback workaround. :)
Because @pnkfelix alluded to it, and I also got a comment along those lines on IRC:
Even though I'd personally be not in favor of allowing it at all, if we'd want to allow arbitrary delimiters strings anyway, then that'd be still compatible with my proposal: Just allow any string not containing "
or ending with whitespace between r#
and "
(The initial #
being needed to make the lexer recognize it as an raw string literal).
Would certainly give good opportunities for self documenting literals:
static RUSTCODE: &'static str =
r## CODE ##"
fn main() {
// Example: This uses a string raw literal to embed an windows-style file path directly.
println(r"C:\Program Files\rust\bin\rust.exe");
}
"## CODE ##;
@Kimundi I think allowing spaces (or any whitespace) is a mistake. Makes it harder to tell what's intentional and what's a typo in the source.
Ruby also uses '
to not interpret, and "
to interpret.
a = 5
puts "Value of a: #{a}"
# => "Value of a: 5"
puts 'Value of a: #{a}'
# => "Value of a: #{a}"
@steveklabnik That syntax is incompatible with parsing lifetimes. If it weren't, I'd have already submitted a PR for supporting 4-char codes using 'FOUR'
syntax.
@kballard awesome, just wanted to make sure that all of the other implementations were covered in what we're looking at.
According to the weekly meeting 2013-09-24, the regular language issue is a non-issue (because of a desire to allow comment nesting, which already makes it non-regular).
I see this as a twofold issue, as 'raw' string literals are really separated into two groups from what I can tell. The use cases described so far are: regexes, which have lots of backslashes; Windows paths, which have lots of backslashes; giant blobs of raw text, which may contain literally anything as often such blobs are generated by other programs or are programs in an unknown-to-rust other language; and format! string directives, which have lots of backslashes.
So for 3/4 cases the only important attribute is a readable way to hold backslashes (which means that regular -style escaping will not suffice). There are a few good proposals which solve this problem; my favorite is r"foo""bar" syntax where only the " char is handled specially (with doubling as the escape). The listed drawback to this approach is that it "Does not reproduce verbatim every character found in the source sequence, which makes it slightly harder/more confusing to read, and more annoying to do things like pasting a raw string into your source file (e.g. raw HTML)."
No scheme will pass through verbatim every character in the source sequence for all sequences. The workaround of ensuring that any single character always passes through verbatim except if its context is composed of other characters which comprise the end delimiter is more complicated than an unconditional (character-based rather than sequence-based) escaping scheme and harder to quickly check.
Using r"foo""bar" syntax will also allow, if a user does insert text containing single double quotes, a compile-time failure so that they can fix the string. It's not a 100% solution since someone wanting two adjacent double-quotes (who would only get one back) would not be warned, but it's a very simple syntax which shouldn't take long for users to learn especially if their likely first mistake of using one quote instead of two would cause a compile-time error.
I don't see a strong case for embedding large blobs of text in source, as that practice is poor form in general: editors rarely provide much support for working with arbitrary languages embedded in strings and the approach is increasingly awkward as the blobs grow in size. I would advise against encouraging this antipattern with language workarounds, especially considering that they do not fully solve the problem of escaping (either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step). Using an include! macro to reference separate files to insert data blobs seems a better approach, as it does fully avoid the problems in delimiters/escaping, and the data blobs can then be constructed separately and statically checked for correctness in their native language as part of the build process without having to extract them from Rust code for that purpose. As a language which likes to demonstrate what can be achieved with types I think it would be a shame to see big text blobs being considered idiomatic rust; we should rather discourage stringly-typed data.
I'd really prefer to have format string syntax and regex syntax that simply use another escape character (like printf
's and lua's %
) over overloading \
to serve that role in way different contexts and then requiring people to select the right string literal syntax for every context. I realise that can't address the use case of hardcoded Windows paths. For embedding output by other programs into rust source code, I think it's reasonable to just pipe them through an adaptor first that properly escapes them if an include!()
macro isn't appropriate there.
I'd prefer to keep the amount of different options of string literals (and I suppose the complexity of a correct lexer) as low as possible in this case. :(
@sp3d
No scheme will pass through verbatim every character in the source sequence for all sequences.
Schemes that use user-controlled delimiters can pass verbatim every possible sequence, merely by modifying the delimiter appropriately.
@ben0x539 We already have fewer string literals than most languages (that is to say, we have one string literal).
@kballard: right, as I discussed--by allowing a wide variety (such as is the case with user-defined ones) of schemes we can get around the fact somewhat, but that will not obviate the need to either update the scheme or escape the contents when making changes: "either you must constantly worry about updating delimiters, or you must escape occurrences in the blob; either way there must be a manual or automatic processing step".
In the interest of keeping the language simple and elegant I would think a solution involving a finite, small number of valid formats for contained text would be ideal, unless there is an as-yet-unmentioned good reason to be placing and maintaining large blobs of a different language inside Rust code.
@sp3d "constantly worry about updating delimiters"? You make it sound like the contents of these raw strings change frequently, and with zero predictability. If I'm embedding a section of raw HTML, I'm pretty sure I can come up with a delimiter that's highly unlikely to show up.
The problem with the r"foo""bar"
solution is that even for short strings, it really sucks when the string contains many quotes. For example if I need a raw string that contains a snippet of code ["this", "is", "a", "vector", "of", "&str"]
then it's pretty bad: r"[""this"", ""is"", ""a"", ""vector"", ""of"", ""&str""]"
.
We could always do something like r(delim)textdelim, for example, r(;)Some text that is terminated by a semicolon;
. Not sure that's very readable, but it definitely seems easy to parse, and doesn't require the use of " characters.
@kballard @Kimundi okay, the team gave the go-ahead to implement Kimundi's proposed r" with hash-tally delimited raw-strings. So now the fun begins; I'd be happy to help shepherd a PR through.
@pnkfelix Huzzah! I'm quite happy to do the implementation myself
I'm currently also looking at the lexer code. If I'm right, this can be done with only local changes to one function. Working at the change atm.
@Kimundi Yes I'm pretty sure it can be done in next_token_inner()
.
@Kimundi I have most of a patch already, can we talk/compare notes on irc or so?
Closed by #9674, nice work everyone!
r#""#
really was a poor choice of delimiter. Didn't anyone think people might want to quote HTML which potentially contains loads of '"#' substrings? 👎
@boosh You can use an arbitrary number of #
on both sides
@jonas-schievink Ah great, thank you! I thought it was strange it hadn't been considered 👍
Excellent choice. For my own language design I've looked at every syntax out there as well as coming up with several of my own, and this is the least verbose and complex, while also allowing any string to be delimited. Fortunately, bad arguments, like the one for doubling "
, were rejected. It's not true that "the only important attribute is a readable way to hold backslashes" -- both backslashes and double quotes are special characters in normal string literals. And the claim that "you must constantly worry about updating delimiters" is simply false. You will never in practice encounter a text containing "##
.
I actually somewhat prefer Kimundi's second proposal, which is a superset of the one adopted: rX"text"X, where X is either empty or is any sequence starting with #
... of course it can be abused, but so can anything, and there's no need to abuse it, and I believe in giving programmers freedom.
A raw string literal is a string literal that does not interpret any embedded sequence, meaning no backslash-escapes. A lot of languages (certainly most that I've used) support some syntax for raw string literals. They're useful for embedding any string that wants to have a bunch of backslashes in it (typically because the function the string is passed to wants to interpret them itself), such as regular expressions. Unfortunately, Rust does not have a raw string literal syntax.
There's been a discussion on the mailing list for the past few days about this. I will try to put a quick summary here.
There's two questions at stake. The first is, should Rust have a raw string literal syntax? The second is, if so, what particular syntax should be used? I think the answer to the first is definitely Yes. It's useful enough, and has enough overwhelming precedence in other languages, that we should add it. The question of concrete syntax is the harder one.
The syntaxes that have been proposed so far, along with their Pros and Cons:
C++11 syntax, e.g.
R"delim(raw text)delim"
.Pros:
Cons:
Python syntax, e.g.
r"foo"
Pros:
Cons:
r"foo\""
evaluates to the stringfoo\"
(with the literal backslash).D syntax, e.g.
r"raw text"
,raw text
, orq"(raw text)"
/q"delim\nraw text\ndelim"
Pros:
Cons:
C#/SQL/something else, using a simple raw string syntax such as
r"text"
where doubling up the quote inserts a single quote, as inr"foo""bar"
Pros:
Cons:
q{text}
. Unfortunately, most viable delimiters will result in an ambiguous parse.%q{text}
. Unfortunately, this also is ambiguous (with the % token).Lua syntax, e.g.
[=[text]=]
Pros:
Cons:
println!([[Hello, {}!]], "world")
in an introduction to Rust would be awfully confusing (see previous point about being non-string-like).Go syntax, e.g.
raw text
. This is one of the variants of D strings as wellPros:
Cons:
foo
in doc comments.A new syntax using ASCII Control characters STX and ETX
Pros:
Cons:
A syntax proposed over IRC is
delim"raw text"delim
.Pros:
Cons:
Some form of Heredoc syntax was also suggested, but heredocs are really primarily concerned with embedding multiline input, not raw input. They also have issues around dealing with indentation and the first/last newline.
During this discussion, only two Rust team members (that I'm aware of) chimed in. Alex Chricton raised issues with the Lua syntax, and threw out the suggestion of Go's syntax, though only as something to consider rather than a recommendation. Felix Klock expressed a preference for C++11 syntax, and more generally stated that he wants a syntax with user-delimited sequences. There was also at least one community member in favor of C++11 syntax.
My own preference at this point is for C++11 syntax as well. At the very least, something similar to C++11 syntax, that shares all of its properties, but there seems to be no value in inventing a new syntax when there's precedent in C++11.