kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.09k stars 61 forks source link

Dedented multiline strings for 2.0 #320

Closed tailhook closed 8 months ago

tailhook commented 1 year ago

Motivation

I think KDL is a great language for future CI systems like github actions, much better than YAML. But as it turns out YAML is the only language that allows embedding scripts in the nice way.

I think this is what KDL can easily fix. While it's currently possible to postprocess KDL output to strip leading whitespace after parsing, the methods of that are quite heavy-weight or don't allow to strip spaces optionally:

  1. Strip in all strings (can't turn off)
  2. Strip only in specific places (i.e. just for "scripts", but then there are also descriptions, etc.) the app itself will have to tackle these places one by one, and break backwards compatibility along the way
  3. Strip specially marked type (dedent)"some multiline string". This makes types a bit weird especially if types have significant value for this specific KDL document.

Other Languages

Previous discussion is quite short, although it mentions:

It's one of those things that looks good at first but when you actually hit the sharp edges, it can hurt a lot, especially in a language like KDL that isn't indentation-sensitive.

I've found four non-indentation-sensitive languages that have this concept:

  1. Nix (the language of Nix package manager and NixOS)
  2. Swift
  3. Julia
  4. Ruby

All of them dedent to the least-indented line. There is slight difference between Nix vs Julia and Swift: nix doesn't account line that has closing quite, which Swift and Julia do. And there is a difference between Julia and Swift, where swift strips trailing newline and Julia doesn't. Not sure about Ruby.

Proposal

  1. Add i"quoted" string string types (where i stands for indented, or indentation-sensitive).
  2. Allow using same technique as in raw strings: i###"Something that has "quotes" inside"###
  3. Dedent to least-indented line, the line having quote is significant (as in Swift and Julia)
  4. Lines containing only whitespace are not included in the indentation size calculation
  5. Include end of line character in the last line (as in Julia)
  6. Only the indentation in original file is stripped, not anything following \n. So this i"line1\n line2" is not dedented.

Examples

Basic example:

script i###"
  echo "hello world"
  "###

Multiple strings example (you should not design you KDL so that is common, but ocasionally it's fine):

script description=i###"
  This is the best script ever
  "### script_body=i###"
  for i in seq 0 10
    echo "Hello world"
  end
  "###

Or probably this style is better:

script \
  description=i###"
    This is the best script ever
    "### \
  script_body=i###"
    for i in seq 0 10
      echo "Hello world"
    end
    "###

(Although description and script_body should better be child nodes if this kind of usage is expected).

Just for completeness, here is how you can include some whitespace at the end of text, or exclude closing end of line:

text_ending_in_space i###"
  something\n   \   
  "###

Is equivalent to:

text_ending_in_space "something\n   "

Alternatives

Status Quo

Doing nothing is an option. But that means that applications have to choose if they want to dedent strings and users have no control of that. This is discussed in motivation part, but I'd like to add few examples.

If application decided to dedent, users can't override that even using some escaping techniques, like:

script "  line1\n  line2"

Where there is no whitespace to dedent under this spec (even if string is marked as i"..."). But app has to strip that indentation.

This can also be confusing like this:

group {
  script i###"
    print("""hello\n  world""")
  "###
}

Will produce an indented print, because the least-indented line after unescaping is world":

  print("""hello
world""")

(If that code is python, it would crash with "unexpected indentation")

If applications decided to dedent, based on special type marker, users will have to opt-in and that's quite long:

script (dedent)"  line1\n  line2"

And in this case types is harder to use for something that is essentially a type:

script (bash)"..."
script (python)"..."

Nix Semantics

Alternative is to use Nix's semantics, which is the same as python's textwrap.dedent function, which isn't influenced by line closing quote. This allows you indenting quotes differently:

script i###"
  echo "hello world"
"###

script \
  description=i###"
    This is the best script ever
  "### \
  script_body=i###"
    for i in seq 0 10
      echo "Hello world"
    end
  "###

Biggest issue with this semantics is that it's impossible to produce indented string. E.g. lets imagine we have some KDL-based domain-specific language to output text formats, some of the outputs are indentation-sensitive. Here is the KDL:

output i"
  fruits:
  "
output i"
    apple:
      price: 10
    orange:
      price: 15
  "

Under the specification, the concatenation of the arguments of these two output nodes will be properly indented yaml:

fruits:
  apple:
    price: 10
  orange:
    price: 15

While it's hard to reproduce the same if we use Nix semantics. Perhaps you have to use escaped string for that:

output i"
  fruits:
"
output "\n  \
  apple:\n    \
    price: 10\n  \
  orange:\n    \
    price: 15
"

Do not Account Escapes

Skip (6) from the proposal. So the difference is that:

script i###"
  echo "hello\n  world"
"###

Under the spec produces:

echo "hello
  world"

But with this proposal it's this instead:

echo "hello
world"

It may also be confusing sometimes:

group {
  script i###"
    echo "hello\n  world"
  "###
}

Will produce an indented string under this proposal:

    echo "hello
  world"

This is because least-indented line is world" after unescaping.

Treat i"-strings as raw

I'm not sure if it's good. For some cases (scripts) it makes a lot of sense, to avoid double escaping. But escapes in scripts aren't so common. And being able to insert some nice unicode escape codes, or do special things with newlines and continuations in string might be useful.

Another good thing is that raw indented strings can be introduced later if i"-strings are not raw. Opt-out will be much harder.

Add ir" or ri" Strings

This can be done later. I'm not sure it's important enough for now.

Alternative Markers

Maybe use d"..." (i.e. dedented), or t".." (i.e. trimmed).

Other approaches:

Using backticks, or triple quotes are possible too. But r"..." precedent gives us a hint to use character prefix (although it complicates grammar a bit).

Ending Thoughts

Sorry for bringing this again. But previous discussion I've found is insufficient. And I think it is worth considering to add to 2.0.

Sorry for long text :)

eugenesvk commented 1 year ago

Thanks for a thoughtful proposal, just wrote part of it in response to https://github.com/kdl-org/kdl/discussions/173, but the realized there must already be an issue for that :)

I'd suggest to stick with quote prefixes like i"" (or any of your other suggestions) similar to r"" rather than have ''' or ``` as that extra word mnemonic makes it easier to remember

zkat commented 8 months ago

I'd like this in KDL 2.0. Not sure yet which specific implementation, though.

tabatkins commented 8 months ago

Having just done some work involving specifying large ASCII-art strings in a kdl document, yeah, I think this is reasonable to include in the lang.

A lot of good work was spent on the JS dedent proposal, and I agree with how they've solved a number of the corner cases. In particular, how they only remove count literal whitespace as indentation, so you can escape a whitespace that it, and subsequent whitespace, is semantically part of the line rather than the indentation.

zkat commented 8 months ago

I really like the JS implementation, tbh. I'd be pretty happy with that one. Seems pretty clear and straightforward. Still gonna be a bit of a pain to define in the grammar, but w/e. @tailhook are you alright with going with JS semantics here?

Additionally, I believe this should be applied to ALL strings, without any special sigil. Example:

jobs foo {
    run "
        echo foo
           echo bar
      echo baz
    "
}

would give you a string value that looks like:

  echo foo
     echo bar
echo baz
tabatkins commented 8 months ago

Ooh, applying it to all is interesting. And absolutely workaround-able, if you do need all the indent; just start one line, any line, with an escaped space, so it'll drop the removeable indent to nothing. It's a rare case, so putting it behind an arcane-but-easy workaround is acceptable, I think.

tabatkins commented 8 months ago

We'd have to spec that it only applies to multiline strings tho - don't want to strip leading spaces from a single-line string, especially if it's only whitespace.

And figure out what to do if the first line has non-ws content; JS dedent errors in that case. Maybe it just turns off the dedent as well? That way you can just paste a multiline block directly into your KDL when you don't care about it visually lining up, and you'll get exactly what you pasted; if you linebreak the content down and then indent it all, you'll instead get dedenting.

zkat commented 8 months ago

@tabatkins the JS spec requires that you have a newline character immediately after the opening `, and nothing else. We could make it a syntax error to have multiline strings that violate that invariant.

So this would become a syntax error:

node "<SP><LF>
  hello<LF>
"

Basically, if you don't have the closing " on the same line, the opening one MUST be immediately followed by a newline, per spec.

tabatkins commented 8 months ago

That's possible, and honestly probably better. It means that, whether you trigger the dedent or not, your stuff is gonna line up with itself properly. Enforcing "make your shit look at least minimally readable" is good when we can get away with it ^_^

tailhook commented 8 months ago

My first though on always dedented strings is that code generation will be harder. There are use cases for generating some KDL documents via text templates, say generate KDL configs using helm. I'm not sure how big issue it will be in practice, though. (I think it a big issue for YAMLs)

Other than that, sounds good.

zkat commented 8 months ago

If I'm thinking of the same problem you're thinking about, a simple \n should work around that:

node "
  \nfoo
"

would be:


foo
zkat commented 8 months ago

This has landed in the kdl-v2 branch. Please feel free to review the implementation/wording and file a new issue if there's anything that should be fixed.

larsgw commented 6 months ago

Do raw strings need the dedent behavior as well? It seems a bit counter-intuitive to me.

tabatkins commented 6 months ago

Yes? Raw strings are just normal strings with escapes turned off; I don't see why that would affect whether they make sense multiline or not.

(If you don't want your raw string (or any string) dedented, just... don't indent the closing quote.)

larsgw commented 6 months ago

Raw strings are just normal strings with escapes turned off; I don't see why that would affect whether they make sense multiline or not.

Right, but now in the (admittedly unlikely) case you have text ending with a newline and trailing whitespace, you need to "escape" that trailing whitespace to avoid the dedent process.

(If you don't want your raw string (or any string) dedented, just... don't indent the closing quote.)

~At the moment, the grammar says any string must either be a single line, or follow the dedent requirements (newline ... newline whitespace).~

tabatkins commented 6 months ago

Right, but now in the (admittedly unlikely) case you have text ending with a newline and trailing whitespace, you need to "escape" that trailing whitespace to avoid the dedent process.

You don't, tho:

node prop=#"
    first line
    last line actually has some whitespace
        v-- ending here

    "#
larsgw commented 6 months ago

I guess so. What I meant is that there's still some post-processing going on.

tabatkins commented 6 months ago

I'm not sure what you mean by that.

Yes, the example I gave dedents the string. If you didn't want any dedent, you'd write:

node prop=#"
    all of these lines start with some spaces
    last line has more whitespace
        v-- ending here

"#

(aka just move the "# to the left margin)

larsgw commented 6 months ago

Hm, that looks fine too. Not sure what I was going for then, sorry.