Clarify status of O(1) indexing semantics of str objects

ncoghlan commented 10 years ago

BPO	21667
Nosy	@gvanrossum, @ncoghlan, @pitrou, @vstinner, @Rosuav, @jimjjewett, @serhiy-storchaka
Files	issue21667_clarify_str_specification.rst: Add indexing note, remove inaccurate references to "characters"

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'docs'] title = 'Clarify status of O(1) indexing semantics of str objects' updated_at = user = 'https://github.com/ncoghlan' ``` bugs.python.org fields: ```python activity = actor = 'Jim.Jewett' assignee = 'docs@python' closed = True closed_date = closer = 'ncoghlan' components = ['Documentation'] creation = creator = 'ncoghlan' dependencies = [] files = ['35489'] hgrepos = [] issue_num = 21667 keywords = [] message_count = 20.0 messages = ['219793', '219794', '219795', '219796', '219797', '219798', '219799', '219800', '219801', '219802', '219805', '219809', '219810', '219811', '219812', '219824', '219936', '219937', '220208', '220211'] nosy_count = 9.0 nosy_names = ['gvanrossum', 'ncoghlan', 'pitrou', 'vstinner', 'docs@python', 'python-dev', 'Rosuav', 'Jim.Jewett', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = 'later' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue21667' versions = ['Python 3.4', 'Python 3.5'] ```

ncoghlan commented 10 years ago

Based on the recent python-dev thread, I propose the following "CPython implementation detail" note in the "Strings" entry of https://docs.python.org/3/reference/datamodel.html#objects-values-and-types

"CPython currently guarantees O(1) access to arbitrary code points when indexing and slicing a string. Python implementations are required to index and slice strings as arrays of code points, but are not required to guarantee O(1) access to arbitrary locations within the string. This allows implementations to use variable width encodings for their internal string representation."

vstinner commented 10 years ago

str[a:b] returns a substring (characters), not an array of code points (numbers).

ncoghlan commented 10 years ago

Guido, I think we need your call on whether or not to add a note about string indexing algorithmic complexity to the language reference, and to approve the exact wording of such a note (my proposed wording is in my initial comment on this issue).

ncoghlan commented 10 years ago

No, Python doesn't expose Unicode characters in its data model at all, except in those cases where a code point happens to correspond directly with a character. A length 1 str instance represents a Unicode code point, not a Unicode character.

ncoghlan commented 10 years ago

Although, you're right, that section of the data model docs misuses the word "character" to mean something other than what it means in the Unicode spec :(

vstinner commented 10 years ago

"Python implementations are required to ..."

By the way, Python \< 3.3 doesn't implement this requirement :-)

ncoghlan commented 10 years ago

Saying that ord() and chr() switch between characters and code points is just plain wrong, since characters may be represented as multiple code points.

We may also want to explicitly note that the Unicode normalisation is implementation dependendent, and that CPython doesn't normalise implicitly except where specifically noted (i.e. during lexical analysis).

ncoghlan commented 10 years ago

Right, narrow builds have long been broken - that's a large part of why this is now the requirement :)

ncoghlan commented 10 years ago

Patch attached that also addresses the characters vs code points confusion.

ncoghlan commented 10 years ago

I ducked the Unicode normalisation question for now, since that's a *different* can of worms :)

pitrou commented 10 years ago

Two things:

I don't think it's very helpful to use the term "code point" without explaining or introducing it ("character" at least can be understood intuitively)
The mention of slicing is ambiguous: is slicing suppoded to be O(1)? how is indexing related to slicing?

ncoghlan commented 10 years ago

If someone doesn't understand what "Unicode code point" means, that's going to be the least of their problems when it comes to implementing a conformant Python implementation. We could link to http://unicode.org/glossary/#code_point, but that doesn't really add much beyond "value from 0 to 0x10FFFF". If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc).

The main advantage of using the more formal "code point" over the informal "character" is that it discourages people from assuming they know what they are (with the usual mistaken assumption being that Unicode code points correspond directly to glyphs the way ASCII and Extended ASCII printable characters correspond to their glyphs). The rest of the paragraph then provides the mechanical details of the meaningful interpretations of them in Python (as length 1 strings and as numbers in a particular range) and the operations for translating between those two formats (chr and ord).

Fair point about the slicing - it may be better to just talk about indexing.

pitrou commented 10 years ago

Not sure what implementing a conformant Python implementation has to do with this; the language specification should be readable by any interested programmers, IMO.

If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc).

That makes all the less useful to use the "proper term" instead of the more intuitive alternative :-)

serhiy-storchaka commented 10 years ago

Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc.

vstinner commented 10 years ago

Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc.

That would be awesome :-) Please open a new issue for that. We can use for example these data: https://wiki.python.org/moin/TimeComplexity

Such data should be close to the definition of the type, or even close to the method. Maybe directly in the documentation of each method?

gvanrossum commented 10 years ago

I don't want the O(1) property explicitly denounced in the reference manual. It's fine if the manual is silent on this -- maybe someone can prove that it isn't a problem based on benchmarks of an alternate implementation, but until then, I'm skeptical -- after all we have a bunch of important APIs (indexing, slicing, find()/index(), the re module) that use integer indexes, and some important algorithms/patterns are based off this behavior.

E.g. searching a string for something, returning the position where it's found, and then continuing the search from that position. Even if the basic search uses find() or regex matching, there's still a position being returned and accepted, and if it took O(N) time to find that position in the representation, the whole algorithm could degenerate from O(N) to O(N**2).

I am fine with the changes related to code points.

For the pedants amongst us, surrogates are also code points. A surrogate pair is two code points that encode a single code point. Fortunately we don't have to deal with those any more outside codecs.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 10 years ago

New changeset 6ffb6909c439 by Nick Coghlan in branch '3.4': Issue bpo-21667: Clarify string data model description http://hg.python.org/cpython/rev/6ffb6909c439

New changeset 7c120e77d6f7 by Nick Coghlan in branch 'default': Merge issue bpo-21667 from 3.4 http://hg.python.org/cpython/rev/7c120e77d6f7

ncoghlan commented 10 years ago

I've merged the character->code point clarifications, without the implementation detail section.

For the time being, that leaves "doesn't provide O(1) indexing of strings" as the kind of discrepancy that often makes an appearance in "differences from the CPython reference implementation" section that many alternative implementations include.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 10 years ago

I think the new wording is an improvement, but keeping the changes minimal left it in an awkward in-between state.

Proposal:

A string is a sequence of Unicode code points. Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``. The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (as :class:`bytes` in the given text encoding) suitable for transport and communication with non-Python utilities. :meth:`bytes.decode` decodes such byte sequences into text strings.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 10 years ago

And even my rewrite showed path dependency; a slight further improvement is to re-order encoding ahead of bytes. I also added a paragraph that I hope answers the speed issue.

Proposal:

A string is a sequence of Unicode code points. Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``. The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (in the given text encoding) as a :class:`bytes` object suitable for transport and communication with non-Python utilities. :meth:`bytes.decode` decodes such byte sequences into text strings.

.. impl-detail:: There are no methods exposing the internal representation of code points within a string. While the C-API provides some additional constraints on CPython, other implementations are free to use any representation that treats code points (as opposed to either code units or some normalized form of characters) as the unit of measure.

python / cpython

Clarify status of O(1) indexing semantics of str objects #65866