python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

Make str.join auto-convert inputs to strings. #87701

Closed rhettinger closed 3 years ago

rhettinger commented 3 years ago
BPO 43535
Nosy @rhettinger, @terryjreedy, @gpshead, @ericvsmith, @tiran, @gst, @serhiy-storchaka, @vedgar, @pablogsal, @tirkarthi, @isidentical, @EmilStenstrom, @jack1142, @kamilturek, @ajoino

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-feature', '3.10'] title = 'Make str.join auto-convert inputs to strings.' updated_at = user = 'https://github.com/rhettinger' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'none' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'rhettinger' dependencies = [] files = [] hgrepos = [] issue_num = 43535 keywords = [] message_count = 21.0 messages = ['388983', '388985', '388992', '388994', '389006', '389048', '389106', '389135', '389144', '389148', '389170', '389171', '389176', '389180', '389181', '389190', '389192', '389213', '389228', '389241', '389410'] nosy_count = 16.0 nosy_names = ['rhettinger', 'terry.reedy', 'gregory.p.smith', 'eric.smith', 'christian.heimes', 'mrabarnett', 'gstarck', 'serhiy.storchaka', 'veky', 'pablogsal', 'xtreak', 'BTaskaya', 'EmilStenstrom', 'jack1142', 'kamilturek', 'ajoino'] pr_nums = [] priority = 'normal' resolution = None stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue43535' versions = ['Python 3.10'] ```

rhettinger commented 3 years ago

Rather than just erroring-out, it would be nice if str.join converted inputs to strings when needed.

Currently:

    data = [10, 20, 30, 40, 50]
    s = ', '.join(map(str, data))

Proposed:

s = ', '.join(data)

That would simplify a common idiom. That is nice win for beginners and it makes code more readable.

The join() method is unfriendly in a number of ways. This would make it a bit nicer.

There is likely to be a performance win as well. The existing idiom with map() roughly runs like this:

 * Get iterator over: map(str, data)
 * Without length knowledge, build-up a list of strings
   periodically resizing and recopying data (1st pass)
 * Loop over the list strings to compute the combined size
   (2nd pass)
 * Allocate a buffer for the target size
 * Loop over the list strings (3rd pass), copying each
   into the buffer and wrap the result in a string object.

But, it could run like this:

AFAICT, the proposal is mostly backwards compatible, the only change is that code that currently errors-out will succeed.

For bytes.join() and bytearray.join(), the only auto-conversion that makes sense is from ints to bytes so that you could write:

 b' '.join(data)

instead of the current:

b' '.join([bytes([x]) for x in data])
fe5a23f9-4d47-49f8-9fb5-d6fbad5d9e38 commented 3 years ago

I can't find it now, but I seem to remember me having this same proposal (except the part for bytes) quite a few years ago, and you being the most vocal opponent. What changed? Of course, I'm still for it.

(Your second list has fourth item extra. But it's clear what you wanted to say.)

ericvsmith commented 3 years ago

I'm +0.5. Every time this bites me, I apply the same solution, so you're probably right that str.join should just do the work itself. And it's no doubt more performant that way, anyway.

And I've probably got some code that's just waiting for the current behavior to raise an error on me if passed the wrong inputs, even if I'd prefer it to succeed.

I should be +1, but I have a nagging "refuse to guess" feeling. But it doesn't seem like much of a guess: there's no other logical thing I could mean by this code. I'm unlikely to want it to raise an exception, or do any other conversion to a str.

serhiy-storchaka commented 3 years ago

It was proposed by newbies several times before. It was rejected because it would make errors to hide unnoticed. Python is dynamically but strongly typed, and it is its advantage.

I am -1.

fe5a23f9-4d47-49f8-9fb5-d6fbad5d9e38 commented 3 years ago

Does strong typing mean you should write

    if bool(condition): ...

or

    for element in iter(sequence): ...

or (more similar to this)

    my_set.symmetric_difference_update(set(some_iterable))

?

As Eric has said, if there's only one possible thing you could have meant, "strong typing" is just bureaucracy.

rhettinger commented 3 years ago

What changed?

It comes up almost every week that I teach a Python course. Eventually, I've come to see the light :-)

Also, I worked though the steps and found an efficiency gain for new code with no detriment to existing code.

Lastly, I used to worry a lot about join() also being defined for bytes() and bytearray(). But after working through the use cases, I can see that we get an even bigger win. People seem to have a hard time figuring out how to convert a single integer to a byte. The expression "bytes([x])" isn't at all intuitive; it doesn't look nice in a list comprehension, and is incomprehensible when used with map() and lambda.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 3 years ago

I'm also -1, for the same reason as Serhiy gave. However, if it was opt-in, then I'd be OK with it.

terryjreedy commented 3 years ago

I am sympathetic to the 'hiding bugs' argument in general, but what bugs would this proposal hide? What bugs does print hide by auto-converting non-strings to strings?

I recently had the same thought as Raymond's: "it would be nice if str.join converted inputs to strings when needed."

I have always known that print() is slower in IDLE than in a console. A recent SO question https://stackoverflow.com/questions/66286367/why-is-my-function-faster-than-pythons-print-function-in-idle showed that it could be 20X slower and asked why? It turns out that while

print(*values, sep=sep, end=end, file=file) # is equivalent to file.write(sep.join(map(str, values))+end)

print must be implemented as the C equivalent of something like

first=True
for val in values:
    if first:
        first = False
    else
        file.write(sep)
    file.write(str(value))
file.write(end)

When sys.stdout is a screen buffer, the multiple writes effectively implement a join. But in IDLE, each write(s) results in a separate socket.send(s.encode) and socket.receive).decode + text.insert(s, tag). I discovered that removing nearly all the overhead from the very slow example with sep.join and end.join made the example only trivially slower on IDLE (5%) than the standard REPL. In bpo-43283 I added the option of speedups using .join and .format to the IDLE doc, but this workaround would be much more usable if map(str, x) were not needed.

fe5a23f9-4d47-49f8-9fb5-d6fbad5d9e38 commented 3 years ago

Matthew: can you then answer the same question I asked Serhiy?

The example usually given when advocating strong typing is whether 2 + '3' should be '23' or 5. Our uneasiness with it doesn't stem from coercions between int and str, but from the fact that + has two distinct meanings.

Of course, binary operators are always like that, even if it's not obvious, since there's always a tension created by difference of types of the left and right operand. Even if it's obvious that 2 - '3' should coerce the second argument to int since str doesn't define -, this can't be a general rule because e.g. set does (what about 2 - {3}?).

But method calls (and many protocols) are _not_ of that kind. As I said above, my_set ^ some_list makes us uneasy (even though list doesn't implement ^), but my_set.symmetric_difference(some_list) doesn't, simply because there is no ambiguity: there is only one thing we could have meant.

The same can be said about "for x in not_an_iterator", or "if not_a_bool".

serhiy-storchaka commented 3 years ago

Vedran, it is not what strong typing means. Strong typing means that '2'+3 is an error instead of '23' or 5. str.join() expects an iterable of strings. If some of items is not a string, it is a sign of programming error. I prefer to get an exception rather of silently conversion of unexpected value to string 'None', '[]' or '\<Foo object at 0x12345678>'.

So if you want such feature, it should be separate method or function.

But there is other consideration. Of 721 uses of the join() method (excluding os.path.join()) in the stdlib, only 10 need forceful stringification with map(str, ...). For tests it is 842 to 20, and for Doc/venv/ it is 1388 to 30. I am sure the same ratio is for any other large volume of code. So that feature would actually have very small use - 1-2% of use of str.join().

Specially to Raymond, map(str, ...) is good opportunity to teach about iterators and introduce to itertools.

rhettinger commented 3 years ago

Of 721 uses of the join() method (excluding os.path.join()) in the stdlib, only 10 need forceful stringification with map(str, ...)

Thanks for looking a real world code. I'm surprised that the standard library stats aren't representative of my experience, perhaps because I tend to write numeric code and do more output formatting than is used internally.

rhettinger commented 3 years ago

FWIW, I'm running a user poll on Twitter and have asked people to state their rationale:

https://twitter.com/raymondh/status/1373315362062626823

Take it with a grain of salt. Polls totals don't reflect how much thought each person put into their vote.

171aa4e6-57d0-49a1-89f2-40d333078274 commented 3 years ago

Since the proposal is fully backwards compatible I don’t think preferring the old version is a reason against this nicer API. After all, people that like the current version can continue using it as they do today.

Teaching Python to beginners is a great way to find the warts of a language (I’ve done it too). In the beginning people struggle with arrays and if-blocks, and having to go into how map and the str constructor work together to get a comma separated list of ints is just too much. Beginners are an important group of programmers that this proposal will clearly benefit.

I’m sure there will be some “None”-strings that will slip through this, but I think the upside far outweighs the downside in this case.

Big +1 from me.

terryjreedy commented 3 years ago

I read all the responses as of this timestamp. They left me more persuaded that joining objects with a string (or bytes) is explicit enough that the objects *must* be coerced to strings.

A problem with coercion in "1 + '2'" is that there is no 'must'. The desired answer could be either 3 or '12', and neither can be converted to the other, so don't guess.

The desired answer for "1 + .5" is much more obviously 1.5 rather than either 1 or 2, plus the former avoids information loss and leaves the option available of rounding or converting however one wants.

One tweet answered my question about masking a bug. Suppose 'words' is intended to be an iterable of strings.

>>> words = ['This', 'is', 'a', 'list', 'of', 7, 'words']  # Buggy
>>> print(*words)  # Auto-coercion masks the bug.
This is a list of 7 words
>>> '-'.join(words)  # Current .join does not.
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    '-'.join(words)
TypeError: sequence item 5: expected str instance, int found

With the proposed change, detection of the bug is delayed, as is already the case with print. How much do we care about this possibility? One possible answer is to add a new method, such as 'joins' or builtin function 'join'.

Given the variety of opinions, I think a PEP and SC decision would be needed.

gpshead commented 3 years ago

-10. I agree with Serhiy. Automatic type conversion is rarely a feature. It leads to silent bugs when people pass the wrong things. Be explicit.

We are intentionally not one of those everything is really a string languages like Perl or Javascript.

This core API behavior change is big enough to need a PEP and steering council approval.

171aa4e6-57d0-49a1-89f2-40d333078274 commented 3 years ago

Terry, Gregory: The suggestion is not to change what 1 + "2" does, I fully agree that it behaves at it should. The suggestion is to change what ",".join(1, "2") does. There's no doubt that the intended result is "1, 2". That's why it's possible to coerce.

About the example with a list with mixed types: If the reason that example is buggy is "this list should only have strings", a better way to enforce that is to add types to enforces it.

gpshead commented 3 years ago

There is a lot of doubt. That should clearly raise an exception because this function is intended to only operate on strings.

Trivial types examples like that gloss over the actual problem.

data_from_some_computations = [b"foo", b"bar"]  # probably returned by a function

... later on, some other place in the code ...

colon_sep_data = ":".join(data_from_some_computations)

I guarantee you that 99.999% of the time everyone wants an exception there instead of their colon_sep_data to contain b"foo":b"bar".

Implicit conversions always lead to hard to pin down bugs. An exception raised at the source of the problem is very easy to debug in comparison.

fe5a23f9-4d47-49f8-9fb5-d6fbad5d9e38 commented 3 years ago

Yes, I know what strong typing means, and can you please read again what I've written? It was exactly about "In the face of ambiguity, refuse the temptation to guess.", because binary operators are inherently ambiguous when given differently typed operands. Methods are not: the method _name_ itself is resolved according to self's type, it seems obvious to me that the arguments should too. Otherwise "explicit fanatics" would probably want to write list.append(things, more) instead of things.append(more).

The only reason we're having this conversation is that when it was introduced, join was a function, not a method. If it were a method from the start, we would've never even questioned its stringification of the iterable elements (and of course it would do that from the start, cf. set or dict update methods).

Gregory: yes, bytes elements are a problem, but that's a completely orthogonal problem (probably best left for linters). The easiest way to see it: do you object to (the current behavior of)

>> s = {2, 7} >> s.update(b'Veky')

? :-)

5e07cdc2-1da9-4eeb-a0e4-66ac79d2f40f commented 3 years ago

FWIW -1 from me too.

That should be solved by creating a new function IMO :

def joinstr(sep, *seq):
    return sep.join(str(i) for i in seq)
tiran commented 3 years ago

I'm also -1 and would prefer something like Grégory's proposal instead.

39a33bf4-ab68-4c66-b7bd-a84a88baa036 commented 3 years ago

For what my opinion is worth, I agree with Grégory's suggestion because the ',' part of ','.join(...) is almost as unintuitive as the problems Raymond's suggestions are trying to fix.

I was going to suggest a builtin to work on both str and bytes, like join(sep=None, strtype=str, *strings) but that interface looks pretty bad...

I think joinstr/joinbytes according to Grégory's suggestion (perhaps as classmethods of str/bytes?) would make the most sense.

terryjreedy commented 1 year ago

I am now not as convinced by my own pro arguments above as I am now. print is exceptional in that it is meant to be quick and possibly dirty output, perhaps for debugging. str.join is a regular method that calculates a string and I see the point about strong typing better than I did before. https://github.com/python/cpython/issues/87701#issuecomment-1093906804 answered my request for a bug example.