Multi-line string literals (blocks of lines)

Tronic commented 5 years ago

I would suggest -- instead of, or in addition to Python-style """ literals -- using indented block syntax for multi-line string literals. E.g.

proc foo =
  let str = ":
    Hello
      World!
  stdout.write str

Where str is defined equivalent to

let str = "Hello\l  World!\l"

This syntax avoids the indentation problem with string literals that .unindent attempts to address. Also, for clarity, all string content appears within the block, not on the opening or closing lines as is with """.

The literal terminates as soon as the block ends (i.e. a non-empty line indented less is found), avoiding the need for """ at the end. This also avoids the need to escape double quotes that belong to the string.

Whitespace at the end of any line and empty lines at the end would be omitted (and could be added via escape sequences in the rare cases where needed). Whitespace-only lines in the middle would become simply \l (no matter if there are spaces or not). This removes any ambiguity with source code formatting and makes the intention explicit.

This suggestion proposes string block to be indented by exactly two spaces (compared to the line with ": in it). Any further initial spaces would become string content.

This could still be used within parenthesis or other expression, provided that the continuation of that expression appears less indented than the string content.

Tronic commented 5 years ago

This is based on standard practices with text file formatting (removal of extra whitespace and adding LF after each line).

Adding \r explicitly at the end of each line completes the CR-LF sequence for Internet protocols (not even Windows needs it in text files anymore).

Any line within the block may be terminated by a backslash. This is useful for splitting otherwise overly long lines on multiple source code lines without adding LFs to the string, and on the last line to prevent the final newline.

awr1 commented 5 years ago

This makes the heredoc situation in Nim too complicated IMO. """something goes here""".unindent is a simple and satisfactory solution that needs no further enhancement, plus the fact that "oh, you can use escape sequences now!" makes things weirder to me. If you have too many concerns about lengthy string literals, honestly it's better to just not use heredocs and just staticRead() from a textfile instead.

Tronic commented 5 years ago

The """ hack of Python is problematic precisely because it mixes source code formatting with string contents. Having PyDocs or .unindent "handle" this is far from satisfactory.

import strutils

# Correct output but messed up source code formatting
for i in 1..2:
  stdout.write("""<li>
  Item
</li>
""")

# Incorrect output (Item not indented)
for i in 1..2:
  stdout.write("""
    <li>
      Item
    </li>
    """.unindent)

# Proposed string literal: clean source code that matches output
for i in 1..2:
  stdout.write(
    ":
      <li>
        Item
      </li>
  )

Fixing this in Python would be quite problematic at this time, but Nim as a new language based on indented blocks definitely /should/ get it right.

Tronic commented 5 years ago

@awr1 Escape sequences and whitespace handling are mentioned for completeness. This proposal requires less of them than the current string literals do. Reading from external files is not really a solution. The need for longer string literals (beyond docstrings) is clear and that's why """ literals exist in the first place; their implementation just sucks.

awr1 commented 5 years ago

Then IMO the behavior of unindent() is probably incorrect, it should eliminate enough whitespace for up to the first non-whitespace character in the string (recording the number of whitespace characters as some variable x) and repeat that operation for every line in the string, eliminating only the first x whitespace characters.

awr1 commented 5 years ago

A new function could be probably added to strutils, or you could add a defaulted boolean option to unindent() to avoid breaking API compat. I agree that this problem should be fixed, but the core language should not have to change for it.

krux02 commented 5 years ago

Generally I like the idea. I never really liked triple string literals as they are messy. Yet I don't like to change the language for this minor annoyance if the workarounds that don't need a language change haven't been fully explored. Scala's solution to this problem is stripMargin

val speech = """Four score and
               |seven years ago""".stripMargin

Another big problem is, I have no idea how to tell my editor (and github and all the other editors out there) that ": is the start of a indentation block based string literal.

Tronic commented 5 years ago

I made a quick proof of concept with minimal changes to lexer. Needs some further work even if accepted to language (like separate lexer token type for this literal).

Araq commented 5 years ago

IMHO the syntax should be:


const foo = '''
  string literal here that
  needs no closing quotes

but it's far too late for this. Yet another way to write string literals is the last thing we need. We would need to patch nimpretty and every Nim syntax highlighter out there. And without highlighting support this feature seems to be quite dangerous.

Tronic commented 5 years ago

@krux02 Most editors seem to ship Nim mode already, and would probably update their handling promptly if the language was changed.

Meanwhile, this certainly is a problem because many editors and Github syntax highlighter consider anything that follows to be a string, until the next " appears somewhere else, although even with the current language syntax (with any language out there, really) they should terminate single-quoted string processing at the first newline.

Indentation is not so much a problem; one extra tab press at most, because standard auto-indent behaves well with this literal.

Library solutions cannot work properly because once the string is formed, information about source code indentation is no longer available. Adding another special character to denote margin isn't really helpful. Also, such solutions cannot avoid the need to escape quote marks within the literal, like the string block does.

Tronic commented 5 years ago

In any case, fixing this sort of issue is much better to do at Nim 0.21, a language used by a handful of projects, rather than after 1.0. Using ": as the token also does not affect existing software (although I would like to see """ deprecated and eventually removed entirely -- far prior to 1.0 release). First I considered """: or similar, but that would break existing software. Also, ":, if put on its own line, provides visual cue to where the left margin of string content goes (given that a string block must be indented exactly two spaces, which is already the recommended indentation for Nim).

awr1 commented 5 years ago

Can this issue be moved to RFCs?

Tronic commented 5 years ago

Regarding the symbol used to start it, ": directly communicates that it is string and a block but has the disadvantage of being mishandled by existing tools. Something that is not considered to be a start of string would be less invasive, e.g. $: would probably communicate the same thing in Nim context but the content would be seen as code in syntax highlighters, and the colon might trigger smart indentation in some tools (in particular, those based on Python rules).

I am definitely open to this sort of suggestion, although I believe that in the long run the support of current tools should not really be a consideration. The benefit of ": is that it instantly triggers any coder to notice that something unconventional is happening, while with $: that might not be as apparent, and the content being a string would be not at all apparent to non-Nim coders.

juancarlospaco commented 5 years ago

YAML already has a very well known and documented contruct for this, why not just use that.

I think is awesome that you can use literal JSON on Nim code directly, then maybe copy that feature of YAML too. YAML is an open format, and already supported by tons of software.

YAML syntax can be very friendly as start of a block because it uses :> or :|, it can live on the sugar module after all thats what Sugar suppose to do.

let variable0 = :>
    YAML like literals.

let variable1 = :|
    YAML like literals.

https://en.wikipedia.org/wiki/YAML#Indented_delimiting :thinking:

SolitudeSF commented 5 years ago

:| and :> could clash with user defined operators, while ": cant. but i dont see why this should be a language change, if all it does is breaks every syntax highlighter.

juancarlospaco commented 5 years ago

sugar.`:>`

then :grey_question:

I agree that I dont feel a huge need for this. 🤷‍♀️

Tronic commented 5 years ago

FWIW, a comment at the end avoids problems with current highlighters without changing anything else (a simple hack - not part of RFC):

    await client.send ":
      HTTP/1.1 200 OK\r
      content-type: text/plain\r
      content-length: 13\r
      \r
      Hello World!
    #"

juancarlospaco commented 5 years ago

a comment at the end avoids problems a hack

:thinking:

SolitudeSF commented 5 years ago

@Tronic which highlighters? github doesnt highlight correctly anyway. in my editor its this

which is correct representation of current syntax, since " strings cant be multiline. and no, #" is not a solution even if it worked.

Tronic commented 5 years ago

@SolitudeSF I use this in VSCode. Obviously tools need to be fixed, and that really shouldn't be a big issue. After all, they already manage to handle the mix of different quotation formats & comment parsing, incl. Nim-specific syntax and escape sequences.

juancarlospaco commented 5 years ago

For stuff like this I just use staticRead 🤷‍♀️

SolitudeSF commented 5 years ago

i dont see how this can be trivially fixed, since most editors use regex based highlighting which cant have indentation awareness.

juancarlospaco commented 5 years ago

Too bad you can not do the strformat formatted multi-line literal fmt""" """ in there. :crying_cat_face:

krux02 commented 5 years ago

@Tronic If editor support can't be provided, I can only reject this feature. What value does it have when virtually no editor will support it, or if it will take years until the editors will have a solution for it? Also I am the one who maintains the emacs integration at this point, it is not like that emacs will magically grow support for this feature.

awr1 commented 5 years ago

I'll admit I was wrong about unindent() not needing any change, but I would much prefer unindent() to be fixed. It honestly feels way too late in the game for a grammar change like this, especially one that may not be reliably workable with certain editor syntax highlighting engines.

Tronic commented 5 years ago

@SolitudeSF Regex cannot match indent?

\n([ \t]*)[^\n]*":\n(\1  [^\n]*\n|[ \t]*\n)*

matches this string block. Use backward lookup or editor's custom handling of captures, if necessary. Every serious editor implements some sort of recursive matching in addition to basic regex to be able to do parenthesis matching, to handle HTML closing tags etc.

If nimpretty is a concern, I am sure I can quickly patch that as well.

GULPF commented 5 years ago

If this can be properly highlighted with a tmLanguage syntax definition (what vscode and many other editors uses) I would be interested to see how. I think it's impossible but I don't know for sure. I tested the YAML tmLanguage syntax definition and it seems pretty broken for strings.

Tronic commented 5 years ago

This sort of approach seems to work (tried in VSCode):

"begin": "( *)(\":)$",
"while": "^\\1  ",

I'll have a proper look later.

Clyybber commented 5 years ago

@Tronic AFAICT your regex cannot deal with arbitrary indentation. In Nim indenting with all numbers of spaces is allowed. If there exists a regex that can work with arbitrary indentation then I will support this feature.

Tronic commented 5 years ago

VSCode highlighter updated to support r": and ": literals. It seems to be working but needs more testing.

@Clyybber Surrounding code may be indented by arbitrary number of spaces. String block contents must be indented by exactly two spaces, compared to the leading line, as discussed in this thread. This is to allow indentation to appear within string content, so any indentation on top of those two spaces are included in the string.

The highlighter marks string content and block indent with separate classes, so that in principle one could style and make the two-space margin visible by CSS effects (not that I recommend doing so).

Clyybber commented 5 years ago

@Tronic I don't think we should enforce those to be indented by exactly two spaces. Instead make the first line dictate the indentation, or make the line with the least indentation inside the string block dictate the indentation.

timotheecour commented 5 years ago

@Tronic relevant discussion: https://forum.nim-lang.org/t/471#23415 (Does Nimrod have a heredoc syntax?) this RFC would have to compare its merits against heredoc.

pros of heredoc

(as used in D, see https://forum.nim-lang.org/t/471#23415):

visually clear (and easy to grep) where string ends
copy pasting a string doesn't require re-indenting it; but indenting at same level as code (followed by .unindent) is an option if user prefers to keep their string at block indent
works in all cases (a suitable identifier always exists that prevents a clash with given string); see note in https://forum.nim-lang.org/t/471#23415 regarding a non-ambiguous way to terminate the heredoc string that can represent any string even if it doesn't end with \n
works better with editors (see below, at least github and sublimetext is ok)

let s = q"EOS
This is a multi-line
heredoc string; no need to re-indentEOS"
echo s

produces: This is a multi-line\nheredoc string; no need to re-indent

Araq commented 5 years ago

If the argument is "you can always come up with a delimiter that isn't used" then Nim's triple quotes work just as well:


const
  s = """
foobar
UNUSED_DELIM
baz
""".replace("UNUSED_DELIM", "\"\"\"")

Requires no language change and is easier to implement for highlighters as it doesn't involve a regex with backtracking (which is NP complete iirc?)

krux02 commented 5 years ago

This is how string literals work in c++11, where R"V0G0N( and )V0G0N" act as delimiters.

const char * vogon_poem = R"V0G0N(
             O freddled gruntbuggly thy micturations are to me
                 As plured gabbleblochits on a lurgid bee.
              Groop, I implore thee my foonting turlingdromes.   
           And hooptiously drangle me with crinkly bindlewurdles,
Or I will rend thee in the gobberwarts with my blurlecruncheon, see if I don't.
                (by Prostetnic Vogon Jeltz; see p. 56/57)
)V0G0N";

Not only does it allow to specify arbitrary delimiters that won't clash with the content, it would also allow to write editor extensions that detect such string blocks for syntax highlighting. Then you can can have SQL strings, python strings, etc all with correct syntax highlighting. Currently Nim has call string literals, for example SQL"""select elephant from africa""". This already works partially, but it won't work for embedded python strings that well, as """ is a very common python token.

timotheecour commented 5 years ago

yes, that's C++'s version of D's heredoc string I mentioned above in https://github.com/nim-lang/RFCs/issues/161#issuecomment-523687596 . Ability to copy paste code without messing with replace to fixup a delimiter to escape (https://github.com/nim-lang/RFCs/issues/161#issuecomment-523795949) is nice. Yes, it's one more thing to learn though.

Tronic commented 5 years ago

@krux02 Theoretically it can overcome the delimiter appearing in content problem. In practice everyone just uses it as another form of """ and complains that R"(...)" looks uglier than the same thing in some other language. As @Araq pointed out, one should not be required to invent unique identifiers. Also, this sort of literal completely fails to address the indentation problem.

Indented-block literals make a clear separation between source code formatting (indent of the block) and string content (any characters within the block). This way clean source code formatting can be preserved without introducing extra whitespace into the string.

For me it is actually really hard to understand how in 2010's people still design formats with issues that were widely understood and fixed in 1990's if not decades earlier. I presume that the argument has always been that "we cannot fix this because of compatibility" and that "it would take years". As I have demonstrated in this thread, fixing it both in the Nim compiler and in popular text editors took only few hours of work, and frankly I've already spent far more than that here, arguing for it.

Araq commented 5 years ago

As I have demonstrated in this thread, fixing it both in the Nim compiler and in popular text editors took only few hours of work

Well we need to check that. I'm not convinced that popular text editors can be "fixed".

krux02 commented 4 years ago

@Tronic I might have changed my opinion about this kind of string literals. They are very valuable for emit and asm statements. They are also very useful for my very own project here.

Clyybber commented 4 years ago

With the introduction of strutils.dedent this can now be closed:

import strutils
proc foo =
  let str = dedent """
    Hello
      World!
  """
  stdout.write str

foo()

will print

Hello
  World!

AmjadHD commented 2 years ago

So, I have to import strutils for this. IMO triple string literals should be dedented by default in 2.0 (like julia).

Varriount commented 2 years ago

@AmjadHD I would suggest writing another proposal for that. I'm even optimistic that it would be accepted, since it's unlikely to cause too much breakage.

nim-lang / RFCs

Multi-line string literals (blocks of lines) #161

pros of heredoc