toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.43k stars 846 forks source link

Support single-quoted strings to avoid double \ #188

Closed bbangert closed 9 years ago

bbangert commented 11 years ago

The example for TOML shows how bad it gets when writing a Windows path that has \, unfortunately some things that get configured need to let a user enter a regular expression. Regular expressions are filled with back-slashes, and adding the extra backslash constantly is very painful.

To give TOML a fighting chance when someone might need to use a few backslashes in a config file, I propose using single quotes to designate a raw string, ie:

somevar = 'this is a \d+ \w+ type of string'

This would retain the existing double-quote rules to avoid breaking existing usage.

rafrombrc commented 11 years ago

I'll (perhaps unfairly) pile on here. Ben and I (among others) are working on a project (https://github.com/mozilla-services/heka) which is currently using TOML as the config language. Unfortunately, our config is in many cases going to contain a large number of regexes. The need to escape backslashes in TOML strings makes this very painful. We'd love to keep using TOML, we prefer it in most ways to the alternatives, but if we can't do so w/o needing to escape every backslash then we'll have to ditch it for YAML.

Thanks for your consideration!

hit9 commented 11 years ago

:+1: Just like the raw string (or unicode string) in Python. It's required especially when the string is an regular expression.

88Alex commented 11 years ago

The need to escape backslashes in TOML strings makes this very painful.

I completely agree. I really think raw strings are a great idea.

pnathan commented 11 years ago

:+1:

Raw string support would be very nice. Pretty sure it'd be easy to parse as well: easier than interpreting \ characters. :)

rafrombrc commented 11 years ago

We've got it working in a fork of this repo: https://github.com/bbangert/toml/commit/787b5d888542c02b1f249efa62214da28b995b6b. Having it working in a particular implementation is not the same as having it accepted as part of the spec, however, so we'll hold off on a pull request until and unless we get a green light that this is a welcome change.

redhotvengeance commented 11 years ago

:+1:

laurent22 commented 11 years ago

+1 that would be a great addition to the spec.

ambv commented 11 years ago

+1 for the use case but wouldn't it be clearer to specify it like this:

[section]
key1 = "normal strings"
key2 = r"raw strings like in Python"

This way a user of the configuration language doesn't have to remember which tick is which. To top it all off, I suggest specifying that backslashes are not special at all and escaping quotation marks hould be like this:

[section]
key3 = r"this is a very ""raw"" string, if you know what I mean."
key4 = r"C:\Users\ambv\"  # this would work
matrixik commented 11 years ago

Then maybe just allow using backticks (```) for strings instead of single and double quotes?

redhotvengeance commented 11 years ago

@matrixik I feel like using backticks instead of quotes would be breaking the expectations of most (all) users that quotes == strings.

But I could see using backticks for the proposed "raw string" (and perhaps that's what you meant).

88Alex commented 11 years ago

Backticks make sense. But in my opinion, single quotes are better. And the r"raw string" doesn't make sense to me, and it would be really hard to parse.

ricardobeat commented 11 years ago

+1 for backticks.

Why not unquoted strings? I have wondered about this since day one.

88Alex commented 11 years ago

Unquoted strings would drive parsers crazy. rawString1 = 'This is a raw string' is so much easier to parse than rawString2 = This is a raw string.

ricardobeat commented 11 years ago

@88Alex not really, since you would just detect the type (already being done) and stop on the first line break. It would introduce some ambiguity though.

88Alex commented 11 years ago

@ricardobeat On second thought, maybe you're right. However, this might severely confuse some programmers who expect strings to be contained by quotation marks.

88Alex commented 11 years ago

Also, does str = This is a string evaluate to "This is a string" or " This is a string"? (Note the space at the beginning.)

ricardobeat commented 11 years ago

@88Alex I'd go for an implicit trim().

any progress/news on this issue?

88Alex commented 11 years ago

@ricardobeat Then what do you do if you want a space at the beginning of your string?

ricardobeat commented 11 years ago

@88Alex quote it?

Anyway, I think we're detracting from the discussion. Fixing the escape behaviour is more important right now.

88Alex commented 11 years ago

@ricardobeat That defeats the whole point of unquoted strings.

Anyway, to get back on topic, I'm really in favour of single-quoted raw strings. Backticks are just unnatural.

pnathan commented 11 years ago

Again, +1.

Single-ticks are used for unescaped/uninterpolated strings in Perl, Python, and Bash. Possibly more languages than those, but those 3 make up a great deal of code by themselves. Argument via legacy is not a great argument, but it does mean that people will expect certain sorts of behavior from single ticks.

sorbits commented 11 years ago

+1 (/cc #80)

jswank commented 10 years ago

+1

This is the second most commented issue in this project, and one in which I've encountered in supporting TOML for configuration files. In addition to the use cases mentioned above and in the closed issue #80, I respectfully disagree with the comment that regexps are "too rare in a config file": many widely used projects rely on painless configuration of regexps. For instance, nginx, apache httpd, postfix, HAproxy, Varnish, etc.

Is this using TOML for configuration of similar applications simply not a use case? Or that the perceived cost of supporting raw strings in the spec outweighs the added complexity of every config file requiring strings with backslashes?

bbangert commented 10 years ago

We've continued to use my fork which supports this as we need it for heka. Would love to see it get added so we can stop using my fork (which seems to be a few commits behind trunk now).

flowchartsman commented 10 years ago

This sure would be nice. I'm also working on a project that uses lots of regexes, and I'd just as soon not double-escape them all. Maybe use backticks? Backticks are good. I like backticks.

BurntSushi commented 10 years ago

I'd also like to weigh in and support this. I don't have any strong opinions on syntax.

BurntSushi commented 10 years ago

I am going to submit a PR tomorrow. I think it might be worth combining raw strings and multi line strings into one addition. Here's why.

I would definitely like to have r"..." be the syntax for raw strings. Both Rust and Python support this syntax as well, so it should be familiar. It also feels more consistent with the rest of TOML.

With that said, it seems like the major complaint of this approach is that double quotes need to be escaped somehow. Using "" would work, but seems less than ideal to me.

One way around this is to support a raw mode with multline strings. So for example, r"""this is a "string" with quotes and \n in it""" would work as you'd expect. (And indeed, this is precisely how you'd accomplish this feat in Python. Or use single quotes.)

redhotvengeance commented 10 years ago

@BurntSushi Out of curiosity, why do you prefer r"" to single quotes? Aesthetically, single quotes seems more consistent with TOML than r"" (to me, at least).

BurntSushi commented 10 years ago

Basically, I'd rather not have to remember which quotation mark to use. r"..." seems much more explicit to me. I think that's about all I've got. (Is there precedent for single quoted strings being "raw" strings elsewhere? PHP rings a bell, but it's been a while.)

jprichardson commented 10 years ago

I'm in favor of single quotes or at least using some other character other than any alpha-numeric. r feels misplaced and looks like a typo. Something like C#'s @ is decent. But overall, I'd much prefer single quotes. That's how PHP does it and it makes a lot of sense.

@mojombo what's your take?

mojombo commented 10 years ago

Coming from a Ruby background, where single quoted strings allow unescaped backslashes, I'd prefer the single quote syntax. The orphaned r out in front of a string feels weird and imbalanced to me, but that's probably just from lack of time spent with Python or other languages that use that type of adornment. Single quotes have the nice property of allowing unescaped double quotes too.

flowchartsman commented 10 years ago

The reason I suggested backticks is that they're easier to parse and less error-prone than multi character state-dependent sequences and also less likely to appear in inside a string than either type of quote, thus there's less to escape, especially in longer strings. A contrived example:

`To quote someone: 'This is a backtick:\`'`

As opposed to:

'To quote someone: \'This is a backtick:`\''

It may only be one character difference here, but imagine longer, or even multiline strings. That, and it sort of gels with the way Go does strings, and I would love to see this become the new standard config file format for Go (especially given the nice reflection-enabled driver @BurntSushi wrote).

Plus I already have Go code written for it ;)

BurntSushi commented 10 years ago

I definitely prefer backticks over single quotes, if only because backticks are less frequent.

Consensus seems difficult. Let's take a step back and think about our goals:

flowchartsman commented 10 years ago

I don't like context-dependent delimiters myself. For one, there's too much choice and more coding involved in parsing them, especially if you want to handle brace pairs the way Perl does with all the nesting and whatnot. This approach is also kind of error prone as the docs note:

$s = q{ if($a eq "}") ... };

Causes an error because there's no opening bracket inside the string, which throws off the nesting. Yuck, no thanks.

On the outset, they seem useful, but in practice I find I usually just pick a character like '#' that I don't expect to find in the string, and this can differ from string to string and, worse, from developer to developer. See you later, consistency. I'd much rather just pick one standard, somewhat uncommon character and be done with it.

My vote would be for anything (including newlines in the config file) to be included in the raw string, regardless of which character is chosen.

BurntSushi commented 10 years ago

I don't really understand why you keep bring up Perl. You don't need to worry about opening/closing because the user adjusts the delimiters:

# All would be valid
r"valid"
r#"val"id"#
r##"val"#id"##

This is the same with Lua's multiline strings:

[[valid]]
[=[val"id]=]
[==[val]=]id]==]

Parsing this seems rather straight-forward in practice with simple look-ahead. Hell, I think a regex with backreferences could even do the trick! r(#*)"(.*)"\1.

flowchartsman commented 10 years ago

Nor do I understand why you love flexible delimiters so much, but my intention is not to start an argument. I bring up Perl because it is the language I have seen this sort of thing in the most, and because it's an extreme in allowing all sorts of different ways to delimit, and I've seen it confuse a lot of developers.

You're absolutely right that it's not so hard to parse, but I guess the opinion I was trying to express is that it IS nonetheless somewhat more complex to parse and that this would be (unnecessarily in my opinion) extended to anything that wanted to serialize it or provide context-sensitive hilighting for it. But that's just my two cents; I am in the "single standard character" camp.

BurntSushi commented 10 years ago

Serialization is a good point. I think parsing it is very easy, but serialization would require the encoder to incrementally check if a delimiter is ever used. It's not difficult, but definitely a little annoying.

I think we have two choices before us.

  1. Throw our hands up, pick a delimiter and say, "Raw strings are quoted with ?. Raw strings may not contain ? but may contain any other UTF-8 encoded character. There are no escapes; what you see is what you get."
  2. Be a bit more principled and allow the delimiters to change, e.g., r"..." or r#"..."# or r##"..."##, ... This trades simplicity for flexibility.

If we go with (1), then I think we have to dismiss " and ' as delimiters, since they are frequently used. I guess I'd support ` as a delimiter.

I think we just need a ruling so we can move on. @mojombo?

sorbits commented 10 years ago

You can allow a pair of single-quotes to be used as escape sequence (as suggested in issue #80), since another single-quoted string should never follow.

Allowing \' is a slippery slope, because you may want to embed that as a literal sequence as well.

Letting the user pick their own delimeters seems like a bad idea since we have already trained our brains to recognize 'single' and "double" quoted strings, not r|custom quoted| strings, where our brain cannot use simple pattern matching.

On 27 Jun 2014, at 13:53, Andrew Gallant wrote:

Serialization is a good point. I think parsing it is very easy, but serialization would require the encoder to incrementally check if a delimiter is ever used. It's not difficult, but definitely a little annoying.

I think we have two choices before us.

  1. Throw our hands up, pick a delimiter and say, "Raw strings are quoted with ?. Raw strings may not contain ? but may contain any other UTF-8 character. What you see is what you get."
  2. Be a bit more principled and allow the delimiters to change. This trades simplicity for flexibility.

If we go with (1), then I think we have to dismiss " and ' as delimiters, since they are frequently used. I guess I'd support ``` as a delimiter.


Reply to this email directly or view it on GitHub: https://github.com/toml-lang/toml/issues/188#issuecomment-47335491

BurntSushi commented 10 years ago

Good point about using '' for escaping, but I'd still rather use ` since it seems to be a more rarely used character, particularly in regexes (which is, I think, the primary use case for raw strings).

OK, so modify my first option to:

Throw our hands up, pick a delimiter and say, "Raw strings are quoted with ?. Raw strings may contain ? by using ?? but may contain any other UTF-8 encoded character. There are no other escapes; what you see is what you get."

not r|custom quoted| strings, where our brain cannot use simple pattern matching.

Almost all raw strings would be r"..." or r#"..."#.

I think I can drop the context quotes in favor of `.

johanfange commented 10 years ago

Please don't underestimate the value of facilitating Windows paths. No need to alienate users and split the community.

Single quotes are familiar, and are also easy to type on European keyboards. It seems like the obvious choice.

This is something I'd be happy to present my non-programmer users with:

[logging]
logfile = 'C:\Temp\log.txt'
BurntSushi commented 10 years ago

@johanfange None of the suggested options for raw strings make writing Windows paths difficult. Specifically, \ is just a \ in raw strings.

mojombo commented 10 years ago

My preference is to use single quotes like Ruby. They are treated very simply:

Single quotes only support two escape sequences.

  • \' – single quote
  • \\ – single backslash

Except for these two escape sequences, everything else between single quotes is treated literally.

This makes it easy to use strings that have backslashes or double quotes in them without a bunch of fuss. You can get a literal \' with \\\'.

johanfange commented 10 years ago

@mojombo That's pretty nice, but unfortunately it breaks UNC-paths (network shares) on Windows, meaning Windows programmers get tons of headaches. A better choice could be to let '' produce '.

[logging]
logfile = '\\server101\Temp\log.txt'

@BurntSushi Granted. My point was more that this is relevant to non-programmer end-users.

As for back tick on a European keyboard:

Back-tick? Yes, the key left of your backspace. ´? No, you must press shift too!

sorbits commented 10 years ago

I’m not too fond of the ruby convention either. Occasionally I have a single slash in a literal string but require two, so I add another, but that fails, and then I remember that I need to write 3 slashes to produce 2. Othertimes I need to end a string with a slash…

I prefer my literal strings to be as literal as possible.

On 27 Jun 2014, at 17:31, johanfange wrote:

@mojombo That's pretty nice, but unfortunately it breaks UNC-paths (network shares) on Windows, meaning Windows programmers get tons of headaches.

[logging] logfile = '\server101\Temp\log.txt'

@BurntSushi Granted. My point was more that this is relevant to non-programmer end-users.

As for back tick on a European keyboard:

Back-tick? Yes, the key left of your backspace. No, that's ´, you must press shift too!


Reply to this email directly or view it on GitHub: https://github.com/toml-lang/toml/issues/188#issuecomment-47360966

BurntSushi commented 10 years ago

This makes it easy to use strings that have backslashes or double quotes in them without a bunch of fuss. You can get a literal \' with \\\'.

But you need to use \\ to write a \. This doesn't help with writing regexes. Writing Perl character classes becomes pretty awful: \\d, \\w, etc.

johanfange commented 10 years ago

@BurntSushi No, you could just write '\d' or '\w', since \d is not an escape sequence. However, writing a literal regex matching a\b, i.e. a\\b, now requires writing 'a\\\\b'. Certainly non-obvious!

BurntSushi commented 10 years ago

@johanfange Ah, right.

So I guess '' with \' and \\ would work, except it makes Windows paths unfortunate to write.

What about the other suggestion, '' with double ' as an escape?

mojombo commented 10 years ago

@johanfange Good point. It's possible to eliminate the \\ escape and let backslash be a normal character in all cases except when used to escape a single quote as \'. This works great, except when you need to end a string with a backslash. Then you have `\Server101\Temp\' and now you have an escaped single quote on the end and all hope is lost. So let's count that as a non-viable option.

Using '' as the escape will solve the problem, but it feels weird to me aesthetically, especially when escapes are done differently in double quoted strings.

BurntSushi commented 10 years ago

It feels weird to me too, but a consensus where everyone is happy seems impossible. I'd still prefer HEREDOC-like delimiters, then there's no escaping needed, ever. But I guess I'd settle for single quotes with '' escapes at this point.

flowchartsman commented 10 years ago

The github markdown uses


To delineate raw blocks with the logic that, even if you might want a pair of quote-like characters in your string to represent some kind of empty string, you're very rarely going to use three in a row.  I believe they chose backticks instead of quotes for a similar reason: they're comparatively rare.

I think at this point it should be a heredoc syntax where the last newline is eaten or backticks. Single quotes are a mistake, I feel.