Compile time textwrap.dedent() equivalent for str or bytes literals

gpshead commented 5 years ago

BPO	36906
Nosy	@rhettinger, @terryjreedy, @gpshead, @stevendaprano, @methane, @serhiy-storchaka, @MojoVampire, @Carreau, @pablogsal, @remilapeyre, @Marco-Sulla, @iforapsy, @jtojnar
PRs	python/cpython#13445

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/gpshead' closed_at = None created_at = labels = ['interpreter-core', 'type-feature', '3.10'] title = 'Compile time textwrap.dedent() equivalent for str or bytes literals' updated_at = user = 'https://github.com/gpshead' ``` bugs.python.org fields: ```python activity = actor = 'iforapsy' assignee = 'gregory.p.smith' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'gregory.p.smith' dependencies = [] files = [] hgrepos = [] issue_num = 36906 keywords = ['patch'] message_count = 46.0 messages = ['342373', '342407', '342420', '342429', '342477', '342488', '342569', '342600', '342909', '342914', '342915', '342916', '342917', '342918', '342927', '342928', '342931', '342938', '342962', '342965', '342968', '342972', '343961', '343991', '350162', '356153', '356154', '356155', '356160', '356162', '356182', '356193', '356198', '356203', '356204', '365688', '381540', '381544', '381545', '381546', '381584', '381599', '381602', '381605', '382456', '398270'] nosy_count = 14.0 nosy_names = ['rhettinger', 'terry.reedy', 'gregory.p.smith', 'steven.daprano', 'methane', 'serhiy.storchaka', 'josh.r', 'mbussonn', 'pablogsal', 'remi.lapeyre', 'Marco Sulla', 'iforapsy', 'Miguel Amaral', 'jtojnar'] pr_nums = ['13445'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue36906' versions = ['Python 3.10'] ```

gpshead commented 5 years ago

A Python pattern in code is to keep everything indented to look pretty while, yet when the triple quoted multiline string in question needs to not have leading whitespace, calling textwrap.dedent("""long multiline constant""") is a common pattern.

rather than doing this computation at runtime, this is something that'd make sense to do at compilation time. A natural suggestion for this would be a new letter prefix for multiline string literals that triggers this.

Probably not worth "wasting" a letter on this, so I'll understand if we reject the idea, but it'd be nice to have rather than importing textwrap and calling it all over the place just for this purpose.

There are many workarounds but an actual syntax would enable writing code that looked like this:

class Castle:
    def __init__(self, name, lyrics=None):
        if not lyrics:
            lyrics = df"""\
            We're knights of the round table
            We dance whene'er we're able
            We do routines and scenes
            With footwork impeccable.
            We dine well here in {name}
            We eat ham and jam and spam a lot.
            """
        self._name = name
        self._lyrics = lyrics

Without generating a larger temporary always in memory string literal in the code object that gets converted at runtime to the desired dedented form via a textwrap.dedent() call. I chose "d" as the the letter to mean dedent. I don't have a strong preference if we ever do make this a feature.

rhettinger commented 5 years ago

I agree that this is a recurring need and would be nice to have.

stevendaprano commented 5 years ago

+1

There's a long thread on something similar here:

https://mail.python.org/pipermail/python-ideas/2018-March/049564.html

Carrying over into the following month:

https://mail.python.org/pipermail/python-ideas/2018-April/049582.html

Here's an even older thread:

https://mail.python.org/pipermail/python-ideas/2010-November/008589.html

In the more recent thread, I suggested that we give strings a dedent method. When called on a literal, the keyhole optimizer may do the dedent at compile time. Whether it does or not is a "quality of implementation" factor.

The idea is to avoid the combinational explosion of yet another string prefix:

urd'...'  # unicode raw string dedent

while still making string dedents easily discoverable, and with a sufficiently good interpreter, string literals will be dedented at compile time avoiding any runtime cost:

https://mail.python.org/pipermail/python-ideas/2018-March/049578.html

gpshead commented 5 years ago

Oh good, I thought this had come up before. Your method idea that could be optimized on literals makes a lot of sense, and is notably more readable than yet another letter prefix.

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

Hi, I have been looking to get more acquainted with the peephole optimizer. Is it okay if I work on this?

gpshead commented 5 years ago

I'd say go for it. We can't guarantee we'll accept the feature yet, but I think the .dedent() method with an optimization pass approach is worthwhile making a proof of concept of regardless.

stevendaprano commented 5 years ago

For the record, I just came across this proposed feature for Java:

https://openjdk.java.net/jeps/8222530

Add text blocks to the Java language. A text block is a multi-line 
string literal that avoids the need for most escape sequences, 
automatically formats the string in predictable ways, and gives the 
developer control over format when desired.

It seems to be similar to Python triple-quoted strings except that the compiler automatically dedents the string using a "re-indentation algorithm". (Which sounds to me similar to, if not identical, to that used by textwrap.)

The JEP proposal says:

A study of Google's large internal repository of Java source code 
showed that 98% of string literals, once converted to text blocks 
and formatted appropriately, would require removal of incidental 
white space. If Java introduced multi-line string solution without 
support for automatically removing incidental white space, then many 
developers would write a method to remove it themselves and/or lobby 
for the String class to include a removal method.

which matches my own experience: *most but not all of my indented triple-quotes strings start with incidental whitespace that I don't care about. But not quite all, so I think backwards compatibility requires that *by default triple-quoted strings are not dedented.

Note that there are a couple of major difference between the JEP proposal and this:

The JEP proposes to automatically dedent triple-quoted strings; this proposal requires an explicit call to .dedent().
The JEP proposal allows the user to control the dedent by indenting, or not, the trailing end-quote;
however that means that in Java you won't be able to control the dedent if the string doesn't end with a final blank line;
Should the dedent method accept an optional int argument specifying the number of spaces to dedent by? (Defaulting to None, meaning "dedent by the common indent".) If so, that won't affect the compile-time optimization so long as the argument is a literal.
the JEP performs the dedent before backslash escapes are interpreted; in this proposal backslash escapes will occur before the dedent.

The JEP also mentions considering multi-line string literals as Swift and Rust do them:

https://github.com/apple/swift-evolution/blob/master/proposals/0168-multi-line-string-literals.md

https://stackoverflow.com/questions/29483365/what-is-the-syntax-for-a-multiline-string-literal

I mention these for completeness, not to suggest them as alternatives.

gpshead commented 5 years ago

Thanks, it's actually good to see this being a feature accepted in other languages.

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

Hi @steven.daprano, @gregory.p.smith. I added the first version of my PR for review.

One issue with it is that in:

def f():
    return "   foo".dedent()

f will have both " foo" and "foo" in its constants even if the first is not used anymore. Removing it requires looping over the code once more while marking the constants seen in a set and I was not sure if this was ok.

serhiy-storchaka commented 5 years ago

Perform the optimization at the AST level, not in the peepholer.

stevendaprano commented 5 years ago

One issue with it is that in: def f(): return " foo".dedent() f will have both " foo" and "foo" in its constants even if the first is not used anymore.

That seems to be what happens with other folded constants:

>>> def f():
...     return 99.0 + 0.9
...
>>> f.__code__.co_consts
(None, 99.0, 0.9, 99.9)

so I guess that this is okay for a first draft. One difference is that strings tend to be much larger than floats, so this will waste more memory. We ought to consider removing unused constants at some point.

(But not me, sorry, I don't have enough C.)

Removing it requires looping over the code once more while marking the constants seen in a set and I was not sure if this was ok.

That should probably be a new issue.

stevendaprano commented 5 years ago

Serhiy's message crossed with mine -- you should probably listen to him over me :-)

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

Perform the optimization at the AST level, not in the peepholer.

Thanks, this makes more sense.

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

Serhiy's message crossed with mine.

And mine crossed with yours, sorry. I will update my PR shortly.

23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 5 years ago

Thanks @serhiy.storchaka, it's far easier to do here. I pushed the patch to the attached PR. Is there a reason the other optimisations in the Peephole optimizer are not done in the AST?

serhiy-storchaka commented 5 years ago

The optimization that can be done in the AST is done in the AST.

serhiy-storchaka commented 5 years ago

While the string method works pretty well, I do not think this is the best way. If 98% of multiline string will need deindenting, it is better to do it by default. For those 2% that do not need deintentation, it can be prohibited by adding the backslash followed by a newline at first position (except the start of the string). For example:

smile = '''\

 XX
 XX      X
          X
    XXX   X
          X
 XX      X
 XX

\\
'''

Yes, this is breaking change. But we have import from __future__ and FutureWarning. The plan may be:

3.9. Implement from __future__ import deindent. 3.11. Emit a FutureWarning for multiline literals that will be changed by dedending if from __future__ import deindent is not specified. 3.13. Make it the default behavior.

stevendaprano commented 5 years ago

While the string method works pretty well, I do not think this is the best way.

Regardless of what we do for literals, a dedent() method will help for non-literals, so I think that this feature should go in even if we intend to change the default behaviour in the future:

3.9. Implement "from __future import deindent". 3.11. Emit a FutureWarning for multiline literals that will be changed by dedending if "from __future import deindent" is not specified. 3.13. Make it the default behavior.

And that gives us plenty of time to decide whether or not making it the default, rather than an explicit choice, is the right thing to do.

gpshead commented 5 years ago

Agreed, I'm in favor of going forward with this .dedent() optimization approach today.

If we were to attempt a default indented multi-line str and bytes literal behavior change in the future (a much harder decision to make as it is a breaking change), that is its own issue and probably PEP worthy.

eadbd4d3-cbb4-42b8-8420-9f80dcde2bcc commented 5 years ago

I've tried a bit PR 13455, I find this way nicer than textwrap.dedent(...), though I wonder if f-string readability (and expected behavior?) might suffer a tiny bit with the order of formatting the f-string vs dedenting.

In the following it is clear that dedent is after formatting:

>> dedent(f" {stuff}")

It might be unclear for the following especially if .dedent() get sold as zero-overhead at compile time.

>> f" {stuff}".dedent()

Could it be made clearer with the peephole optimiser (and tested, I don't believe it is now), that dedent applies after-formatting ?

Alternative modifications/suggestions/notes:

I can also see how having dedent applied before formatting with f-string could be useful or less surprising ( a d"" prefix could do that... just wondering what your actual goal is).
Is this a supposed to deprecating textwrap.dedent ? Duck-typing and stuff, could textwrap.dedent work on non-str things and the current implementation not ( it assumes the .dedent() method exists) and thus be backward-incompatible ?

stevendaprano commented 5 years ago

It might be unclear for the following especially if .dedent() get sold as zero-overhead at compile time.

Oh, please, please, please PLEASE let's not over-sell this! There is no promise that dedent will be zero-overhead: it is a method, like any other method, which is called at runtime. Some implementations *might *sometimes be able to optimize that at compile-time, just as some implementations *might *sometimes be able to optimize away long complex arithmetic expressions and do them at compile time.

Such constant-folding optimizations can only occur with literals, since arbitrary expressions aren't known at compile-time. F-strings aren't string literals, they are executable code and can run thngs like this:

f"{'abc' if random.random() \> 0.5 else 'xyz'}"

So we don't know how many spaces each line begins with until after the f-string is evaluated:

f"""{m:5d}
{n:5d}"""

Unless we over-sell the keyhole optimization part, there shouldn't be anything more confusing about dedent than this:

x, X = 'spam', 'eggs'
f"{x}".upper()
\# returns 'SPAM' not 'eggs'

Could it be made clearer with the peephole optimiser (and tested, I don't believe it is now), that dedent applies after-formatting ?

We should certainly make that clear that

Personally, I think we should soft-sell on the compile-time optimization until such time that the Steering Council decides it should be a mandatory language feature.

Alternative modifications/suggestions/notes:

I can also see how having dedent applied **before** formatting with f-string could be useful or less surprising ( a d"" prefix could do that... just wondering what your actual goal is).

I don't see how it will make any difference in the common case. And the idea here is to avoid yet another string prefix.

Is this a supposed to deprecating textwrap.dedent ?

I don't think so, but eventually it might.

eadbd4d3-cbb4-42b8-8420-9f80dcde2bcc commented 5 years ago

Oh, please, please, please PLEASE let's not over-sell this!

Sorry didn't wanted to give you a heart attack. The optimisation has been mentioned, and you never know what people get excited on.

Such constant-folding ...

Well, in here we might get that, but I kind of want to see how this is taught or explain, what I want to avoid is tutorial or examples saying that .dedent() is "as if you hadn't put spaces in front".

I don't think so, but eventually it might.

Ok, thanks.

Again just being cautious, and I see this is targeted 3.9 so plenty of time. I believe this will be a net improvement on many codebases.

methane commented 5 years ago

Can we dedent docstring too?

Is there any string like inspect.cleandoc(s) != inspect.cleandoc(s.dedent())?

gpshead commented 5 years ago

I think dedenting docstring content by default would be a great thing to do. But that's a separate issue, it isn't quite the same as .dedent() due to the first line. I filed https://bugs.python.org/issue37102 to track that.

rhettinger commented 5 years ago

We should consider dedicated syntax for compile-time dedenting:

d"""\
  This would be left aligned

      but this would only have four spaces

  And this would be left-justified.
"""   # Am not sure what to do about this last line

gpshead commented 4 years ago

Another option not using a new letter: A triple-backtick token.

def foo():
    value =


    long multi line string i don't want indented.

A discuss thread was started so I reconnected it with this issue. See https://discuss.python.org/t/trimmed-multiline-string/2600/8

serhiy-storchaka commented 4 years ago

I think it would be better to use use backtick quotes for f-strings instead of the f prefix. This would stress the special nature of f-strings (they are not literals, but expressions). But there was strong opposition to using backticks anywhere in Python syntax.

6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

If I can say my two cents:

I preferred that the default behaviour of multi-line was to dedent. But breaking old code, even if for a little percentage of code, IMHO is never a good idea. Py2->Py3 should have proved it.
yes, the new prefix is really useless, because it's significant only for multiline strings. Anyway, if this solution is accepted, I propose t for trim.

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 4 years ago

Is there a reason folks are supporting a textwrap.dedent-like behavior over the generally cleaner inspect.cleandoc behavior? The main advantage to the latter being that it handles:

'''First
Second
Third
'''

just fine (removing the common indentation from Second/Third), and produces identical results with:

'''
First
Second
Third
'''

where textwrap.dedent behavior would leave the first string unmodified (because it removes the largest common indentation, and First has no leading indentation), and dedenting the second, but leaving a leading newline in place (where cleandoc removes it), that can only be avoided by using the typically discouraged line continuation character to make it:

'''\
First
Second
Third
'''

cleandoc behavior means the choice of whether the text begins and ends on the same line at the triple quote doesn't matter, and most use cases seem like they'd benefit from that flexibility.

gpshead commented 4 years ago

.cleandoc is _probably_ more of what people want than .dedent? I hadn't bothered to even try to pick between the two yet.

6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

Anyway there's something strange in string escaping and inspect.cleandoc():

>>> a = """
... \nciao
...     bello
... \ ciao
... """
>>> print(inspect.cleandoc(a))
ciao
    bello
\ ciao
>>> print("\ ciao")
\ ciao

I expected:

>> print(inspect.cleandoc(a))

ciao
bello
 ciao
>>> print("\ ciao")
 ciao

6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

Excuse me for the spam, but against make it the default behavior I have a simple consideration: what will expect a person that reads the code, that doesn't know Python?

IMHO it expects that the string is *exactly* like it's written. The fact that it will be de-dented it's a bit surprising.

For readability and for not breaking old code, I continue to be in favor of a letter before the multi-string. Maybe d, for de-dent, it's more appropriate than t, since it does not only trim the string.

But probably there's a better solution than the letter.

serhiy-storchaka commented 4 years ago

The user expects what they read in the documentation of what they learn in other programming languages. If we update the documentation their expectation will change.

As for other programming languages, Bash has an option for stripping all leading tab characters from a here document, and in Julia triple-quoted strings are dedented (https://docs.julialang.org/en/v1/manual/strings/#Triple-Quoted-String-Literals-1). Since Julia is a competitor of Python in science applications, I think that significant fraction of Python users expected Python triple-quoted strings be dedented too, especially if they are dedented by help() and other tools.

serhiy-storchaka commented 4 years ago

Julia syntax looks well thought out, so I suggest to borrow it.

6740fb1f-e34b-433e-a958-b77f9c4427e8 commented 4 years ago

When Python started to emulate the other languages?

Who cares about what other languages do? Python uses raise instead of throw, even if throw is much more popular in the most used languages, only because raise in English has more sense.

And IMHO a newbie that see a multi-string in the code does not read the documentation. It's evident that is a multi-string. And it expects that it acts as in English or any other written language, that is the text is *that* one that (s)he read.

On the contrary, if (s)he reads

d""" Marco Sulla """

maybe (s)he thinks "this must be something different", and read the docs.

dfccbadb-471a-4cdc-bcd7-93fbe865a193 commented 4 years ago

A related issue(which I believe has no topic in this forum yet) is substituting an expression that results in a multiline string into a multiline f-string while matching its indentation. If a new type of string prefix is made to auto-dedent, maybe the substitutions should match the local indentation. Some related stackoverflow posts:

https://stackoverflow.com/questions/36739667/python-templates-for-generating-python-code-with-proper-multiline-indentation

https://stackoverflow.com/a/57189263/2976410

I.e. ideally we would have:

def make_g_code():
  nl='\n'
  return d"""\
    def g():
      {nl.join(something(i) for i in range(n))}
      return something_else
    """

This still has issues. Newline needs to be put into a variable, for instance. In the case of using this template for languages, great many use braces for delimiting blocks and those need to be escaped inside f-strings.

An implementation that works with spaces only (does not suit my case where mixed indentation is possible) is here:

http://code.activestate.com/recipes/578835-string-templates-with-adaptive-indenting/

Please let me know if this is the wrong place to comment on this issue.

gpshead commented 3 years ago

(assigning to me as I want to help remi.lapeyre's .dedent() method PR move forward)

terryjreedy commented 3 years ago

I am opposed to more prefix characters.

Are any string literal optimizations done now at compile time, like for 'abc'.upper? It does not exactly make sense to me to do just .dedent. But without the optimization, adding str.dedent without removing textwrap.dedent does not add a whole lot. (But it does add something, so I am mildly positive on the idea.)

terryjreedy commented 3 years ago

I read to the end of the patch and found astfold_dedent.

A small point in favor of textwrap.dedent is immediately announcing that the string will be dedented.

stevendaprano commented 3 years ago

A string prefix would be a large language change that would need to go through Python-Ideas, a PEP and Steering Council approval. Let's not go there :-)

A new string method is a comparatively small new feature that probably won't need a PEP or Steering Council approval.

A compile-time optimization is an implementation feature, not a language feature. Whether devs want to follow up with more string optimizations like 'spam'.upper() is entirely up to them.

Backwards compatibility implies that textwrap.dedent is not going away any time soon, but it should probably become a thin wrapper around str.dedent at some point.

gpshead commented 3 years ago

I expect several phases here:

(1) add a .dedent() method to str (and bytes?) - behaviors to consider mirroring are textwrap.dedent() and inspect.cleandoc(). Given their utility and similarities, it makes sense to offer both behaviors; behavior could be selected by a kwarg passed to the method.

https://docs.python.org/3/library/textwrap.html#textwrap.dedent

https://docs.python.org/3/library/inspect.html#inspect.cleandoc

(2a) Ponder the d" prefix - but in general I expect sentiment to be against yet another letter prefix. They aren't pretty. This would need a PEP. Someone would need to champion it.

(2b) Ponder making cleandoc dedenting automatic for docstrings. This would be enabled on a per-file basis via a from __future__ import docstring_dedent or similar as Serhiy suggested. No prefix letter required. Several releases later we could consider making it the default. This will also need needs a PEP.

(3) Optimizations when .dedent() is called on a constant? Nice to have, But I suggest we land (1) first as its own base implementation PR. Then consider the follow-ons in parallel.

I believe the current patch contains (1)+(3) right now. If so we should simplify it to (1) and make (3) am immediate followup as saving the runtime cost and data space is worthwhile.

Ultimate end state: probably 1+2b+3, but even 1+3 or 1+2b is a nice win.

methane commented 3 years ago

(1) add a .dedent() method to str (and bytes?) - behaviors to consider mirroring are textwrap.dedent() and inspect.cleandoc(). Given their utility and similarities, it makes sense to offer both behaviors; behavior could be selected by a kwarg passed to the method.

I don't think we need two algorithms here. I'm +1 to add str.dedent() which mirroring only inspect.cleandoc().

serhiy-storchaka commented 3 years ago

Multiline string literals were added recently in Java (https://openjdk.java.net/jeps/378). The semantic is similar to Julia: autodedent and ignoring the first blank line.

Multiline string literals have similar semantic in Swift (https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html).

methane commented 3 years ago

I don't think we need two algorithms here. I'm +1 to add str.dedent() which mirroring only inspect.cleandoc().

I withdraw this. If we add str.dedent(), it must not be optimized for triple-quote literal.

Auto dedenting is very nice to have. It can be different from inspect.cleandoc().

We may able to cleandoc() automatically, even without from __future__. This can be different from str.dedent() or auto dedenting.

We already have a separate issue for docstring. And auto dedenting will needs PEP. How about focus on str.dedent() and change the issue title?

9382c1a1-d822-4ea2-ba53-cfe65cb23f7d commented 3 years ago

One benefit of using a compile time feature over a runtime method is that the former allows for more predictable dedenting by first dedenting and only then interpolating variables.

For example, the following code does not dedent the test string at all:

import textwrap

foo = "\n".join(["aaa", "bbb"])

test = textwrap.dedent(
    f"""
    block xxx:
        {foo}
    """
)

It would be much nicer if we had syntactical dedents based on the leftmost non-whitespace column in the string, as supported by Nix language, for example.

test = (
    df"""
    block xxx:
    {textwrap.indent(foo, ' '*4)}
    """
)

It would be even nicer if the interpolated strings could be indented based on the column the interpolation occurs in but that can at least be done at runtime with only minor inconvenience.

stevendaprano commented 3 years ago

This feature (or one very like it) has been requested again, this time with an "i" prefix:

https://discuss.python.org/t/indented-multi-line-string-literals/9846/1

julianfortune commented 1 year ago

What do folks think about taking an approach similar to Scala's stripMargin:

>>> my_str = """
...    |Foo
...    |     Bar
...    |  Baz
...    """.stripmargin()
Foo
    Bar
  Baz

Seems it could be achieved with only the addition of a new method to str, no?

stevendaprano commented 1 year ago

How is stripmargin different from dedent?

julianfortune commented 1 year ago

@stevendaprano my apologies, I did not realize dedent took into account shared whitespace; it seems dedent completely fills the need although it is a little more clunky that it is a separate function versus being implemented as a member of str.

stevendaprano commented 1 year ago

The proposal is to make dedent a string method, and then allow interpreters to optimise it into a compile-time operation (perhaps in the keyhole optimiser), not to use a function like this:

def func(x):
    textwrap.dedent("""blah blah blah
    multiple lines
    """
    )

That won't work because it won't be seen as a docstring.

This issue hasn't been touched for 18 months. It looks like this issue is languishing, possibly over the question whether the method should implement textwrap.dedent or inspect.cleandoc.

python / cpython

Compile time textwrap.dedent() equivalent for str or bytes literals #81087