Decide how to handle str/unicode

python / typing

Python static typing home. Hosts the documentation and a user help forum.

https://typing.readthedocs.io/

Other

1.6k stars 238 forks source link

Decide how to handle str/unicode #208

Closed gvanrossum closed 4 years ago

gvanrossum commented 8 years ago

There's a long discussion on this topic in the mypy tracker: https://github.com/python/mypy/issues/1141

I'm surfacing it here because I can never remember whether that discussion is here, or in the typeshed repo, or in the mypy tracker.

(Adding str, bytes, unicode, Text, basestring as additional search keywords.)

gvanrossum commented 8 years ago

A new proposal: use the existing triad bytes-str-unicode.

gvanrossum commented 8 years ago

Let me try to explain the new proposal with more care.

Gradual Byting

I am most interested in solving this issue for straddling code; my assumption is that most of the interest in type annotations for Python 2 has to do with that. (This is the case at Dropbox, and everyone who has enough Python 2 code to want to annotate it probably should be thinking about porting to Python 3 anyway. :-)

In the proposal, str has a position similar to the one that Any has in the type system as a whole -- i.e. assuming we have three variables b, s, t, with respective types bytes, str, (typing.)Text, then: b is compatible with s, s is compatible with b and t, t is compatible with s, but b and t are not compatible with each other. IOW the relationships between the three types are not expressible using subtyping relationships only. (It's actually a little more complicated, I'll spell out the actual rules I'm proposing below.)

Before we get to that, I'd like to discuss the use cases for the proposal. In straddling code we often have the problem that some Python 2 code is vague about whether it works on bytes or text or both. The corresponding Python 3 code may work only on bytes, or only on Text, or on both as long as they are used consistently (i.e. AnyStr), or possibly on Union[bytes, Text]. To find such cases, maybe we could just type-check the code twice, once in Python 2 mode and once in Python 3 mode. If it type-checks cleanly in both modes, it should run correctly in both Python versions too (insofar as type-checking cleanly can ever say anything about running without errors :-).

However, when we have a large code base, it is usually a struggle to make it type-check cleanly even in one mode, and typically we start with Python 2. So if we have code that runs correctly using Python 2 and type-checks cleanly in Python 2 mode, and we want to port it to Python 3, requiring it to type-check cleanly in Python 3 mode is setting the bar very high (as high as expecting it to run correctly using Python 3).

Therefore I am proposing a gradual approach. Similar to the way we start by type-checking an untyped program (which by definition should type-check cleanly, since all types are Any -- even though in practice there are some holes in that theory), I propose to start with a Python 2 program that uses str for all string types, and type-checks cleanly that way, and gradually change the program to replace each occurrence of str with either bytes or Text (or one or the rarer alternatives like AnyStr or Union[str, Text]). That way we can gradually tighten the type annotations, keeping the code type-check clean as we go.

Just like, when I define a function with def f(x: Any), I can call f(1), f('') and f([0]) and it's all the same to the type checker, and f's body I can use x+1, x() or x[0], the idea here is that a function defined with def g(s: str) can be called as g(''), g(b'') or g(u''), and in g's body I can use s+b'xxx', s+'yyy' or s+u'zzz'.

The actual details are a bit subtle. I'm proposing (in builtins; recall that this is for Python 2 only):

class bytes with (mostly) the methods currently present on str, with arguments of type bytes and returning bytes (as appropriate).
class str(bytes) with overloaded methods that return str if the other argument is a str, returning bytes for `bytes (more or less).
class unicode unchanged from its current definition, keeping typing.Text as a pure alias for it.

The subclassing relationship between bytes and str makes str acceptable where bytes is required. In mypy we can add a "promotion" from bytes to str to enable compatibility in the other direction. Mypy (in Python 2 mode) has an existing promotion from str to unicode that accepts str where unicode is required. I don't actually propose to make unicode acceptable where str is required (this is a deviation from the "str is like Any" idea). Because promotions are not transitive (unlike subclassing), bytes is not acceptable where unicode is required, nor the other way around.

There is still a lot more to explain. I want to show in detail what happens in various cases, and why I think that is right. I need to explain the concept of "text in spirit" to motivate why I am okay with the difference between these rules and the actual workings of Python 2. I want to go over some examples involving container types (since that's where the "AsciiBytes" proposal went astray). And I need to give some guidelines for stub authors and changes to existing stubs. (E.g. I think that Python 2 getattr() may have to be changed to accept unicode.)

[But that will have to wait until tomorrow.]

JukkaL commented 8 years ago

t [Text] is compatible with s [str]

This contradicts with this part of the proposal:

I don't actually propose to make unicode acceptable where str is required

Also, the example with def g(s: str) suggests that it can be called as g(u''), with a unicode argument. This should be clarified and made consistent across the proposal, as otherwise things get confusing.

Because promotions are not transitive (unlike subclassing)

Mypy actually considers the promotions int -> float and float -> complex transitive, and int can be promoted to complex. We could change the language to something like "these promotions are not transitive" or we could perhaps treat the int -> complex promotion as a separate promotion.

Other notes:

I'd assume that str methods would return unicode if the other argument is unicode. Currently this is left unspecified. It could be useful to have table of the result types of s1 + s2 for all combinations of str, bytes and unicode (9 cases).
AnyStr would have to range over str, bytes and unicode. This means that we may want to give different meanings to IO[str] and IO[bytes], for example.
List[Any] is compatible with List[int] and vice versa in mypy (though PEP 484/483 seems to be silent on this), but should List[str] be compatible with List[bytes], and vice versa? I'd argue that List[str] and List[bytes] should be incompatible, similar to how List[int] and List[float] are incompatible, but I don't have a strong opinion on this.

gvanrossum commented 8 years ago

This contradicts with this part of the proposal

That's why I wrote It's actually a little more complicated. I am having a hard time summarizing the proposal briefly and also writing it up in detail without contradictions between the two. In case of conflict the detailed version should win and the summary seen as a hint at most. Maybe we'll have to use more vagueness in the summary to avoid confusing experts who know the terminology.

the example with def g(s: str) suggests that it can be called as g(u'')

More imprecision in the summary. :-( It really can't, unless g() is implemented in C in a certain way, e.g. getattr(x, u"foo"). But for a Python function this is wrong. Actually, for a Python function, the other way around is also wrong. But nevertheless the promotion allows it. Just like the promotion from int to float is technically wrong in Python 2, as shown here:

def f(a):
    # type: (float) -> float
    return a/2
assert f(3) == 1.5  # Fails, it returns 1

I will try to spec out the true compatibility as a bunch of tables.

str methods would return unicode if the other argument is unicode

Yes. There are already some overloads like that. The bigger difference will be that these overloads won't exist for bytes+unicode.

AnyStr would have to range over str, bytes and unicode. This means that we may want to give different meanings to IO[str] and IO[bytes], for example.

Yes. In fact IO[unicode] would only be obtainable by calling io.open().

should List[str] be compatible with List[bytes], and vice versa?

I think not (so we agree here). This will lead to some of the same issues as I ran into when trying to implement the AsciiBytes idea, but the issues will much less common.

[In the next installment I will try to construct the tables of compatibilities. I will also talk about the concept of "text in spirit".]

gvanrossum commented 8 years ago

Text in Spirit

(This is still pretty messy. But I promised I would explain the concept.)

I'll sometimes say that some variable in Python 2 is "text in spirit". For example, in getattr(x, name), name is "text in spirit". In this case I mean two things with this: first, that in Python 3 the name argument to getattr() has type str, not bytes. Second, that even in Python 2, the name is an identifier, and even though you can write getattr(x, '\xff\x01'), that would be useless.

Note that text encoded as bytes is not "text in spirit". The requirement is that the corresponding Python 3 API uses str, and the Python 2 API supports bytes or unicode, though not necessarily all bytes or all unicode -- e.g. getattr() only accepts a unicode name if it contains only ASCII characters, even though it doesn't make that requirement when the argument is str.

Basically the point of "text in spirit" is to make the argument that an API should not use bytes even though it may accept non-ASCII str instances. But I have to do more exploration before I decide how important this concept is.

JukkaL commented 8 years ago

Here's another table that could be useful -- if we define def f(s: s1) -> None: ..., is a call with an argument of type s2 valid, when s1 and s2 range over str, bytes and unicode.

gvanrossum commented 8 years ago

Compatibility Tables

[UPDATE: made function calls primary, per Jukka's suggestion below]

Let's start by stating the compatibility between expressions of types bytes, str, text and functions with arguments of those types. Each row corresponds to a declared argument type; each column corresponds to the type of an expression passed in for that argument.

Argument type	xb: bytes	xs: str	xt: Text
arg_b: bytes	Yes (same)	Yes (str <: bytes)	No
arg_s: str	Yes (promotion)	Yes (same)	???
arg_t: Text	No	Yes (promotion)	Yes (same)

The above table also describes compatibility of expressions with variables (assuming the type checker, like mypy currently, doesn't just change the type of the variable). Note that I'm not decided yet whether to allow passing a Text value to a str argument, but I'm inclined to put "No" there, even though that breaks the illusion of "str as the Any of string types" (IOW gradual byting :-).

Next let's describe the return type for expressions of the form x + y where x and y can each by of type bytes, str, or Text.

x	yb: bytes	ys: str	yt: Text
xb: bytes	bytes	bytes	ERROR
xs: str	bytes	str	Text
xt: Text	ERROR	Text	Text

Note that this table is more regular and I'm pretty confident about it.

gvanrossum commented 8 years ago

Irregularities

`encode()` and `decode()`

For bizarre reasons, in Python 2 both str and unicode support both encode() and decode(). This makes no sense, e.g. u'abc'.decode('utf8') is equivalent to u.abc'.encode('ascii').decode('utf8'), and 'abc'.encode('utf8') really means 'abc'.decode('ascii').encode('utf8').

I propose to rationalize this to the extent possible, as follows:

bytes only support .decode(), and it returns unicode
unicode only supports .encode(), and it returns bytes (not str!)
str supports .encode(), returning bytes, and .decode(), returning unicode

This would mean complete removal of unicode.decode() from the stubs, since it basically always means some terrible misunderstanding happened. For variables declared as bytes, it would likewise remove the encode() method, whose use would point to a similar (but opposite) misunderstanding. For str we remain generous (since using str means the code probably hasn't received enough attention from the straddling police).

`str()` and `repr()`

The return types of bytes.__str__() and bytes.__repr__() are still str, because that's how they are constrained by object. (FWIW these are examples of methods returning "text in spirit" strings.)

JukkaL commented 8 years ago

A reason we might prefer to use a call instead of an assignment as a basis for the table is that some type checkers like to infer a new type from assignment, and thus arbitrary assignments are considered correct -- they just redefine the type of a variable.

gvanrossum commented 8 years ago

OK, edited the text.

JukkaL commented 8 years ago

If we don't do the Text -> str promotion (which seems reasonable), then the original "gradual byting" story may need tweaking, as the first step would be to annotate with str and Text only (not just str, because unicode literals wouldn't be compatible with it), and the gradual byting migration would migrate some str types to bytes (or maybe unicode). Also, the gradual migration may involve changing some '' literals to b'' literals

Example first phase annotation where we'd need Text:

def utf8_len(x: Text) -> int:
    return len(x.encode('utf8'))

utf8_len(u'\u1234')

Here we'd need to use Text unless we include the Text -> str promotion.

vlasovskikh commented 8 years ago

@gvanrossum @JukkaL I would like to join your discussion.

Here is a summary of the current ASCII types and gradual byting proposals as I understand them.

Rationale

(This piece is here to make sure we’re solving the same problem.)

The changes in text and binary data handling in Python 3 is one of the major reasons people cannot run their Python 2 code on Python 3. These changes were:

Disabled implicit str to unicode and vice versa conversions using the ASCII encoding
Some missing methods or methods with different semantics for the text and binary classes

The most viable approach to porting to Python 3 is via porting to Python 2+3.

Therefore we need a way to make this transition from a Python 2 program with implicit conversions to a Python 2 program with less implicit conversions to a Python 2+3 program that runs on both versions but still contains some implicit conversions to a Python 3 program.

ASCII Types

The idea is to introduce two new types, one new types compatibility rule and special type inference rules for binary and text literals.

The new types are:

class typing.ASCIIBytes(bytes): ...
class typing.ASCIIText(typing.Text): ...

The ASCIIBytes type is compatible with ASCIIText for Python 2.

If a text or binary literal contains only ASCII characters then type checkers should infer the corresponding ASCII types instead of regular text / binary types.

Pros

People can be very precise about their Python 2 types and implicit conversions. This precision is much needed for forbidding implicit conversions eventually while porting from Python 2 to Python 2+3 to Python 3.

Cons

List[ASCIIBytes] is not compatible with List[str] that causes lots of errors according to Guido’s experiments. As a workaround, we might make ASCIIBytes compatible with any TypeVar that is constrained by ASCIIText or Text despite it’s variance. The cost of this workaround is more false negatives for unsafe modifications of invariant collections of strings.
People will have to write ASCII types for their ASCII-preserving functions (e. g. for a user-defined equivalent of str.upper() .

Gradual Byting

A proposal by Guido described here.

The idea is to distinguish the str in type hints from bytes and Text (unicode in Python 2). Then there are new types compatibility rules:

bytes is compatible with str
str is compatible with bytes
Text is maybe compatible with str (??? undecided yet)
str is compatible with Text

Questions

How the issue with List[bytes] being not a subtype of List[str] is resolved here? (see the similar issue in the ASCII types proposal)
If Text is not compatible with str then str is no longer a safe staring point in the Python 3 migration process. Won’t it confuse people?
What should be the type of functions like getattr() that do implicit text to bytes conversion?

What do you think of the workaround for invariant collections of ASCIIBytes?

Could you please answer my questions about the gradual byting proposal?

gvanrossum commented 8 years ago

Writing my answer...

Re: Rationale

Agreed, although your way of describing the changes makes it sound like Python 3 is a step back from Python 2 in this respect; I believe the opposite (PY3 is better than PY2).

The key part is that some things around strings changed and we want to provide a gradual way to convert PY2 via straddling (2+3) to (eventually) PY3.

The proposal is mostly concerned with how to write string types for the straddling case (in a way that works in PY2 and is idiomatic PY3).

Re: ASCII Types

I'm not sure you characterized the proposal the same way as I heard it (as retold by @ddfisher), but you're pretty close.

IIRC the proposal actually made ASCIIText compatible with all bytes, and ASCIIBytes compatible with all Text, in Python 2. So e.g. getattr(obj, name) requires name to be str in Python 3, but in Python 2 it accepts all bytes (as an alias for str) and ASCIIText, but throws UnicodeEncodeError for Text instances containing non-ASCII characters. "Morally" I think it's text, not bytes, since in Python 3 it only accepts text strings, and the best annotation for straddling code is name: str.

As an example that goes in the other direction, in Python 2, s.encode('utf8') works for all Text and for ASCIIBytes, but throws UnicodeDecodeError if s is a str bytes object containing non-ASCII bytes. In Python 3 only text strings have this method. Again s s "morally" text, but we have to type it differently (I'd propose s: Text).

I'd like to point out that in both these examples (taken from the behavior of builtins, there are many more like it) declaring the argument or variable as ASCIIBytes or ASCIIText is inappropriate, since while ASCII characters/bytes are accepted in either persuasion (text or bytes), the full range of one or the other base types (bytes or Text) is still accepted.

Another problem I have with the ASCII proposals is that it emphasizes literals too much. Yes, in examples we often like to write things like getattr(obj, 'xxx') and then it works out nicely that the argument is an ASCIIBytes. But in real code it's much more likely that you're computing a name from some other information and then pass it to getattr(obj, name). That computed name is much more likely to have the (inferred or explicitly declared) type str, or if it came from a unicode-aware computation it could have the type Text. In the latter case, inferring (i.e., proving) through such a computation that the value is in fact ASCIIText is usually too hard.

Re: Gradual Byting

(NOTE: The nickname "gradual byting" may actually be a bad pun, as in the end the compatibility rules are more complicated than those for Any in gradual typing.)

How the issue with List[bytes] being not a subtype of List[str] is resolved here?

(You probably meant supertype, since I propose to make bytes a supertype of str in PY2.)

It's not completely resolved, but I think it's more reasonable to ask people to distinguish between bytes and str in their annotations (and thinking!) than to start introducing the ASCII types (which have no use in PY3 code).

My main reason that I find this less of a problem than the corresponding problem with List[str] vs. List[ASCIIBytes] is that we actually have bytes literals in PY2. (It's a little tricky to keep track of them in the parser, but not impossible, and I plan to do it. So if you meant a list of bytes, use [b'xx', b'yy', b'zz']. Also, making the distinction carefully is useful going forward to pure PY3 code, while distinguishing between ASCII and non-ASCII is artifical (only PY2 cares about them).

If Text is not compatible with str then str is no longer a safe starting point in the Python 3 migration process. Won’t it confuse people?

This situation is inherently confusing, because in PY2, sometimes str+unicode works, and sometimes it doesn't. As long as you don't know, you may be better off with Any.

What should be the type of functions like getattr() that do implicit text to bytes conversion?

I think we have two choices: str or Text. Both are "morally" text but in PY2, str has preference for 8-bit characters while Text prefers Unicode characters. Since this is a C function that internally works with 8-bit characters (in PY2) I think it should be typed as str and from this I concede that making Text compatible with str may actually be the right thing to do.

So in the end maybe "gradual byting" is correct. Ideally for straddling code, you should run mypy twice, once with --py2 and once without. The recommendation would then be to strive for the following, in straddling code (or code striving to become straddling):

Use bytes where PY2 has strings that must be bytes in PY3
Use str where PY3 has strings and it's complicated in PY2
Use Text where PY2 uses Unicode

Mypy in --py2 mode would have to learn that bytes <-> str and str <-> Text but not bytes <-> Text, and it would have to assign the correct type based on the form of literal:

b'x' -> bytes
'x' -> str (unless from __future__ import unicode_literals; then Text)
u'x' -> Text (an alias for unicode)

BTW I find from __future__ import unicode_literals an anti-pattern that does more harm than good, and I now recommend against it.

vlasovskikh commented 8 years ago

The full gradual byting idea (bytes <-> str <-> Text, but not bytes <-> Text) sounds logical and easy to explain. I particularly like the following about it:

You can start with just using str type hints in your Python 2 code base, you'll get no false positives
You can gradually introduce new more precise types bytes and Text and get better type checking, while keeping str in places where it's complicated

(The type checking rules for generic types like List[str / bytes / Text] are still a bit unclear. I guess the idea is to pretend that List[str] is compatible with List[bytes] and List[Text] and vice versa.)

What I don't like is that gradual byting gives up on checking for implicit ASCII conversions. Basically we'll be able to check the two distinctly marked bytes and Text subsets of the program and won't be able to tell anything about implicit ASCII conversions since str is compatible with both bytes and Text.

But it seems that it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeErrors and UnicodeEncodeErrors in PY2 programs. @gvanrossum @JukkaL Do you agree with it?

I would like to experiment with the idea of gradual byting in PyCharm for the next few days to see if there are any concerns with it. I'll report about my findings later this week.

gvanrossum commented 8 years ago

type checking rules for generic types like List[str / bytes / Text] are still a bit unclear

My current inclination is not to do anything special about these, because List is invariant. Although if str was really analogous to Any here it would indeed work. I think experiments will have to decide whether it's needed.

it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeErrors and UnicodeEncodeErrors in PY2

Yes, that's the most important use case we have for mypy at Dropbox -- we want our code to become more Python 3 ready.

Mixing bytes and Text is a type error. But it's harder to argue that mixing ASCII and non-ASCII is a type error. I don't like to treat strings containing only ASCII characters as a subtype of bytes or Text, because dynamic sources of characters (other than literals in the source code) don't typically tell you whether they can ever return non-ASCII characters.

As an analogy, let's say we wanted to treat non-negative integers as a subtype of int. If we define a "type" as a set of values, this is certainly a reasonable thought, and non-negative integers are closed under addition and multiplication (just like ASCII strings are closed under concatenation and slicing). But there are few input functions that return only non-negative integers -- int() in particular can definitely return a negative int. So it's hard to enforce the non-negativity of integers being processed by a program without explicit range checks or complicated proofs that a certain algorithm preserves that property.

I feel it's similar for the ASCII-ness of bytes and Text -- a function that reads a string from a file or socket (or from e.g. os.environ()) has no particular reason to believe that the file will only contain ASCII characters.

I'm looking forward to the outcome of your experiments. In the meantime I will also try to look into a more complete set of changes to typeshed and mypy.

vlasovskikh commented 8 years ago

I'm done with my experiments with the idea of gradual byting.

I've created a proof-of-concept implementation of gradual byting in PyCharm by modifying __buitlin__.pyi from Typeshed and tweaking our type inference engine and type checker. Then I tried to port some real-life code from PY2 to PY2+3 using the modified IDE.

Original Gradual Byting

What I've learned from my experiment with the original gradual byting proposal is that the type checker doesn't help in porting PY2 code to PY2+3. I mean if you already use Text and bytes alongside with str then yes, it helps you to some extent.

But overall you get no guidance on how to proceed with porting your code. The type checker doesn't tell you if there are any variables or functions with no type hints. It doesn't promote the use of Text and bytes instead of str. It doesn't catch most of the text/binary data compatibility errors.

Guided Gradual Byting

During my further experiments I came up with the following guided process for making PY2 text/binary data handling more PY3-like. The original gradual byting is a step in this process.

Summary of Proposed Changes

Gradual byting
- Make Text <-> str <-> bytes, but Text and bytes are not compatible
- Infer u'foo' -> Text, 'foo' -> str, b'foo' -> bytes
Introduce typing.NativeStr as an alias for str that says explicitly that it's a native string
Recommend new type checking options
- Warn about implicit Any for declarations
- Warn about str in type hints (use Text / bytes / NativeStr instead)
- Strict str checks (disables gradual byting promotions, use cast() if you're sure)
Recommend extra type checking options for not 100%-typehinted code
- Warn about implicit Any for expressions
- Warn about str literals

Idea

(Note: It includes some pictures, please see the comment on the GitHub page).

Hypothesis: Most of text/binary data in PY2 programs can be converted to either Unicode data or 8-bit data in PY2+3. Native str strings (the ones that are strictly 8-bit in PY2 and Unicode in PY3, i.e. mixed in PY2+3) are the minority. Handling native strings causes many problems while porting from PY2 to PY2+3. Type checkers should make you aware of these problems and provide some help in reducing the amount of native strings.

In PY3 you have a clear separation of text and binary data: they are not compatible with each other. In PY2 things are complicated because of the implicit conversion between text and binary data using the ASCII encoding.

If you want to make your PY2 code PY2+3 compatible (straddling), you have to make it more PY3-like in respect of text/binary data separation. A good way to proceed is to start putting type hints into your code so that a type checker would be able to check your code for correctness.

The steps of the proposed approach to porting are described below. You may start with no or some type hints in your code. You may proceed module-by-module or modify the whole program at once.

1. Add type hints for all declarations in your code

Use the "Warn about implicit Any for declarations" type checker option to get notified of all the places where type hints are missing.

For text and binary data you have the following options:

Use Text when data is Unicode in PY2 and PY3
Use bytes when data is 8-bit string in PY2 and PY3
Use AnyStr when your code works with both Unicode and 8-bit strings in PY2 and PY3 as long as all the function arguments are of the same type (or use Union[Text, bytes] if it doesn't matter)
Use NativeStr when data is Unicode in PY3 and 8-bit string in PY2 (see the footnotes about NativeStr)
Use str otherwise (if things are complicated)

2. Remove all `str` entries in type hints in favor of `Text`/`bytes`/`NativeStr`/etc.

Use the "Warn about str in type hints" type checker option to get notified of all the remaining occurrences of str in type hints.

Go through all the places you marked with str as complicated and figure out which of the text/binary types is actually appropriate here.

The purpose of this step is to make the native string subset as small as possible since a) native string operations are the hardest to port; b) your code will look more PY3-like with mostly Text and bytes subsets.

3. Enable strict `str` checking

Use the "Strict str checks" type checker option to enable strict separation of Text, str, and bytes data in your code.

The remaining type checker warnings at this step show the most tricky parts of your text/binary data handling code that has to be carefully written to become PY2 and PY3 compatible. It may involve:

Re-categorizing your values into Text, NativeStr, and bytes again
Doing things differently for PY2 and PY3 inside if PY2 conditions
Explicitly casting types using typing.cast(<type>, <value>) if you're sure what are you doing and you need a way make the type checker happy

Extra options

These type checker options might be helpful during PY2 to PY2+3 porting if your code is not 100% type hinted:

Warn about implicit Any for expressions as well
Warn about str literals

I found these extra options very useful when you have more modules to port or you use third-party libraries with no type hints / stubs.

Footnotes

PyCharm doesn't support the stubs from Typeshed yet, it's still a work in progress.

The idea of an option for warnings about declarations with no type hints comes from the --noImplicitAny option of the TypeScript compiler. It is used heavly in the TypeScript community for testing the TypeScript stubs of untyped JavaScript libraries.

typing.NativeStr is a new type needed to help people get rid of ambiguous str (is it really a native string or is it a marker that things are complicated?). It could be an alias to str. There should be an option to warn about any str and unicode entries in type hints.

vlasovskikh commented 8 years ago

@gvanrossum @JukkaL I'm looking forward to your feedback.

gvanrossum commented 8 years ago

Sorry, I'm tied up at the core python sprint this week. I hope to have time next week!

JukkaL commented 8 years ago

@vlasovskikh Thanks for the detailed write-up! Your approach sounds mostly reasonable. If @gvanrossum agrees, hopefully we can can experiment with it and mypy.

A few things I'm not sure about:

1) The str / NativeStr distinction

This could be useful during migration, but I'm not sure if users will find it easy to understand. An alternative would be to propose that users define a similar type alias by themselves instead of including it in typing.

2) Strict str checking

The implications of a mode that requires casts between str and other string types for Python 2 are still unclear to me. Stubs would potentially require things like Union[str, Text] in places for things to work seamlessly, without needing seemingly redundant casts from str to Text/bytes when interacting with library modules. I'm not sure how much of a problem this would be. Also, we'd need a separate stub for the str class in this mode.

3) AnyStr

Would AnyStr range over str, bytes and Text? Functions that use AnyStr would likely now be a little tricky to write in some cases. Consider this function:

def f(x: AnyStr) -> AnyStr:
    return x + 'a'

This would be fine in Python 2 mode but it wouldn't work in Python 3 or strict str checking mode. Here's a straightforward straddling implementation that actually wouldn't work, since given a str argument, the return type would be bytes, not str in Python 2 mode:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, Text):
        return x + u'a'
    else:
        return x + b'a'

This may have to written like this, which seems a bit excessive but perhaps still reasonable:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, Text):
        return x + u'a'
    elif isinstance(x, str):
        return x + 'a'
    else:
        return x + b'a'

It seems that the final example would also work in the strict str checking mode.

vlasovskikh commented 8 years ago

@JukkaL Replying to your points:

1) The str / NativeStr distinction

I also thought about not introducing typing.NativeStr and suggesting the users to create their own aliases.

An advantage of having typing.NativeStr is that if we tell people to get rid of str (since this type doesn't provide clear distinction between text and binary data) we better suggest some easy to use alternatives. The "Add your own alias to str and use it instead if you want to suppress type checker warnings" sounds less persuading than "Get rid of all str and use typing.NativeStr if you really need native strings".

On the other hand, when a person enables the strict str checking, it doesn't matter if they had typing.NativeStr or their custom MyStr or they still have str. It will be a type checking error in this mode anyway.

So I'm not sure about this one. It just looked to me as a convenient step in the porting process between having no distinction between bytes <-> str <-> Text and enforcing strict str checking.

2) Strict str checking

The point is that by default str checking is not strict, i.e. gradual byting is the default. So any questions about problems with strict str checking are really about this specific mode when a person is trying to make their code more PY2+3 compatible.

I believe that the standard library (or in fact any code) should not have type hints that mix str with any other types (e.g. Union[str, Text] shouldn't be used). Could you please give any examples from the standard library where such type hints could be useful / necessary?

The strict str checking mode helps to find and fix those places where developers haven't decided what data they really handle in their code: Text or bytes or both (conditionally) or (on rare occasions) native strings. Therefore the strict str checking mode shouldn't not let these places go unnoticed. We have the default gradual byting mode for that.

Yes, I imply that Text, bytes, and str all have separate stubs as in the original gradual byting proposal.

3) AnyStr

I'm not sure about this one.

In the default gradual byting mode it doesn't really matter if ranges over str / Text / bytes or just over Text / bytes.

As for the strict str checking mode it looks like AnyStr could still range just over Text and bytes, but maybe there should be a special type checking rule that str is still compatible with AnyStr as AnyStr covers all the variants.

JukkaL commented 8 years ago

I believe that the standard library (or in fact any code) should not have type hints that mix str with any other types (e.g. Union[str, Text] shouldn't be used). Could you please give any examples from the standard library where such type hints could be useful / necessary?

I'm thinking about cases where the code is using NativeStr. I'm not sure what would be the most common use cases for NativeStr, though. getattr might be one, but the latest plan is to give the second argument the type Text, I think, so perhaps using Text and u'x' literals with getattr and friends is okay. However, idiomatic code would use str literals, so here support for NativeStr seems reasonable.

Yes, I imply that Text, bytes, and str all have separate stubs as in the original gradual byting proposal.

But also the stub for str would likely be different when using strict str checking, so that u'foo' + 'bar' would be disallowed in the latter mode but not by default. When using normal checking str.__add__ could accept Text arguments, but when using strict str checking, it would only accept str arguments.

As for the strict str checking mode it looks like AnyStr could still range just over Text and bytes, but maybe there should be a special type checking rule that str is still compatible with AnyStr as AnyStr covers all the variants.

Would this mean that AnyStr would range over Text and bytes when type checking generic functions where AnyStr is bound, but when calling such a generic function, AnyStr could be substituted with str as well? If yes, that sounds reasonable.

Example:

def f(x: AnyStr) -> AnyStr:  # OK, here AnyStr ranges over bytes and Text
    if isinstance(x, Text):
        return x + u'a'
    else:
        return x + b'a'

y = f('x')  # OK, here str is also valid. The type of y would be str.

gvanrossum commented 8 years ago

@JukkaL and I chatted "offline" about this and I'm writing up a few thoughts.

I don't like NativeStr much. Maybe it could be named String? The idea would be that PY3.bytes always maps to PY2.bytes, but PY3.str maps tp PY2.String or PY2.Text depending on whether PY2 usage is focused on 8-bit characters or Unicode.
mypy has a few options already for rejecting "default Any" in definitions and in calls; while there are some corner cases where we could be even more strict, I don't think mypy needs more flags to encourage users to add annotations. (PyCharm may need more flags though? That's not the topic of this issue though.)
I don't like another flag to control how strictly to check strings; flags are inflexible, I'd much rather see code select the level of checking through the types they use in annotations.
Perhaps this would work? bytes, String and Text are checked "strictly", i.e. they don't mix. Even bytes+String should be an error. AnyStr becomes a type var with three value restrictions (bytes, String, Text). But (in PY2) str is "the Any of strings" and is compatible with all of these. In PY3, str==String==Text.
If a union is required, it should be Union[bytes, String, Text] or Union[String, Text], never Union[str, Text]. Note that in PY3, Union[String, Text] becomes just Text (IOW just str).
Before we can do a decent experiment with mypy, we'll have to fix the typed-ast ast27 lexer to distinguish between b'x' and 'x' -- b'x' would be bytes, but 'x' would be String.
Not clear what to do for the stubs; I'm not sure what pytype does with Text but they could do the same thing with String.

vlasovskikh commented 8 years ago

Thanks for your comments. I will reply the next week as soon as I'm back from PyCon JP.

vlasovskikh commented 8 years ago

@gvanrossum Yes, your idea of having a separate type String for "native" strings which doesn't mix with others is very similar to my proposal. And in PY2 AnyStr may range over it as well.

I'm a bit worried about the name String though. It may be unclear for those who are new with this problem why we keep adding new things that are synonyms to str: AnyStr, Text, and now String. NativeString says at least a bit more about the purpose of this type. Also this terminology of "native" strings is already used, for example, in PEP 414: Explicit Unicode Literal for Python 3.3. This PEP criticizes some parts of PEP 333: WSGI 1.0.1 related to using Latin-1 in "native" strings, but suggests the useful distinction of PY2+3 compatible text strings / "native" strings / binary data.

Regarding the mentioned flags / options, they could be PyCharm-specific. I don't propose to make them a part of the specification.

To sum up the updated proposal:

Make these types incompatible with each other in PY2 and PY3
- Text strings: Text
- "Native" strings: String (or NativeString, or something else?)
- Binary data: bytes
Make str in PY2 compatible with text strings and native strings and binary data
- Add a separate stub for class str in typeshed to distinguish it from the others
Make AnyStr range over text strings / native strings / binary data

Those who want to port their code to PY2+3 should follow these steps:

Add type hints for all declarations
Remove all usages of unicode and str as not PY2+3 compatible

What do you think about it? If this idea is concrete enough for experimenting maybe I should write a separate PEP about it or update PEP 484 in a pull request?

gvanrossum commented 8 years ago

OK, I found more references to "native string", e.g. http://python-future.org/what_else.html#native-string-type so let's use NativeString. (It always means "str" there.) We also now have typed_ast 0.6.0 which tells us the difference between b'x' and 'x' in PY2 mode.

I guess the next step is an experiment in mypy and typeshed. In typeshed, for PY2 we need three separate types: bytes, NativeString, Text==unicode, each incompatible with both others. (But the behavior of bytes and NativeString should be the same.) And in mypy, for PY2 str is a magical type that somehow is compatible with bytes, NativeString and Text==unicode. I'm on board with this experiment and I think it's what you propose for PY2. I don't know exactly how to code up the "str is compatible with all three" part in mypy, but I think I can figure it out.

But I'm hesitant to make NativeString a separate type in PY3 -- I think it's simpler to just make it an alias for Text==str, and I don't see much downside. My ideal is not to mess with PY3 at all (except for adding an alias).

More details for PY2 mode: the types of literals should determine the type, i.e. b'x' -> bytes; 'x' -> NativeString or Text==unicode depending on from __future__ import unicode_literals; u'x' -> Text==unicode.

Now, conversions. Should these always be done using cast()? That's simplest to implement, but painful to write. Maybe instead of cast(NativeString, x) we should allow NativeString(x)? I guess we'll consider that as a bonus feature if these casts are common.

Finally, str() used as a function. This is all over the place and we can't tell users to change this to NativeString() or Text() -- ideally all these string types should be only used in annotations. But this would go against your recommendation: "Remove all usages of unicode and str as not PY2+3 compatible".

I also don't want to have to modify many stubs to use NativeString instead of str -- there is a trend where more and more stubs are shared between PY2 and PY3 (the 2and3 directories in typeshed) and I don't want to have to change those. So I'm still on the fence about how well all this will work out. But I'll come up with a working branch of mypy+typeshed that can do this (possibly even a mypy flag to turn this on or off).

JukkaL commented 8 years ago

What about a str class in typeshed for Python 2? I think that it needs to be separate from the other 3.

Perhaps the type of str(x) could be str in Python 2? It would be consistent with how everything else works, and the implicit coercions would make it Just Work, hopefully (though with some loss of type safety). For the other types, users can perhaps use bytes(x), Text(x) or NativeString(x) (okay, the last one is kind of ugly).

Also, if NativeString is needed a lot, I wonder if it should be called NativeStr instead? It maps to str at runtime after all, and spelling it using the short form would make that a little more explicit.

gvanrossum commented 8 years ago

I'm fine with NativeStr. I suppose we could have a str type too, with identical definition as NativeStr, and just have mypy itself take care of the aliasing. (Also don't forget basestring! I guess all string types should inherit from it, and it should continue not to have any methods of its own.)

Though there are a lot of places where it currently hardcodes builtins.str, e.g. the type of dict(x=1) is Dict[str, int] where the str is hardcoded. Perhaps related, when you call a function using f(**d), the type of the argument d must be a subtype of dict[str, Any].

Also, I'm starting to look at the necessary typeshed changes here, and I'm getting a bit worried about the practical implications here. The existing str and unicode types, as well as bytearray, actually have lots of cross-references. Some of these are expected, e.g. unicode.encode() returns str and str.decode() returns unicode. I guess in the new system, unicode.encode() should return bytes and everybody's decode() should return unicode.

But there are lots of weird cases too, e.g. encode() and decode() take an encoding name, which itself is currently declared as unicode -- it can be str or unicode, and the implicit promotion makes unicode the most convenient type to use (not AnyStr, nor Union[str, unicode]).

But there are more weird cases! str.count() takes a unicode arg, and so does unicode.count(), and both implicitly allow str arguments too, because of the str->unicode promotion. In fact this is pretty common, also found for endswith(), find(), index(), lstrip(), partition(), rfind(), rindex(), rpartition(), rsplit(), split(), startswith(), strip().

Finally str.__lt__ and __le__, __gt__, __ge__ take a unicode argument. (But __eq__ and __ne__ take an object.)

There's also some explicit support for bytearray, where partition() and rpartition() are overloaded three ways to take str, unicode and bytearray (and the argument type reoccurs in the return type).

Oh and of course there are also some uses of AnyStr, e.g. str.__add__ takes an AnyStr and returns the same. (I'm curious why there's no __radd__?)

Sorry about the long low-level rant, I'm just concerned that if we define separate and mutually incompatible classes bytes, str, NativeStr and unicode==Text, it's hard to maintain all these and to make sure that all combinations do something reasonable. The problem is that it's not just a set of rules for "when types X and Y meet, the result is Z". There need to be separate rules for each method, or at least for each of various categories of methods. In a previous incarnation I tried to do this for just str, bytes and unicode, where str was a subclass of bytes, and even there it was complicated to know whether I was doing it right. Maybe it's a little simpler without explicit subclass relationships. But I don't know exactly how and when mypy's promotion code works. (I will study it.)

gvanrossum commented 8 years ago

FWIW here's the tip of the iceberg:

vlasovskikh commented 8 years ago

Replied in PR https://github.com/python/typeshed/pull/580.

gvanrossum commented 8 years ago

I'm really sorry for dropping this. I'm afraid I cannot bring up the energy to move this issue forward. It hasn't felt important recently, the necessary changes to typeshed and mypy didn't feel straightforward enough, and now I think the status quo is probably fine. At least we have ways to refer to bytes and unicode (Text) in both versions, which really helps in writing straddling code (though not necessarily in type-checking).

There's a team at Dropbox that's experimenting with straddling code, their approach is simply to run mypy twice, once in Python 2 mode and once in Python 3 mode. It seems to work well enough for them.

I don't want to close the issue -- unlike "stuck" PRs, languishing issue don't bother me quite as much. Maybe someone else wants to work on some of the tricky implementation issues, e.g. having separate bytes and str types in Python 2, using them consistently in all the other stubs, and making mypy handle the various combinations correctly. But I recommend trying to make very small steps at first -- I think I bit off too much with my various attempts at introducing a separate bytes type in typeshed. (It's amazing how subtle the reasoning is about whether something "really" takes a str or a bytes.)

JukkaL commented 8 years ago

It's actually pretty interesting that the simple-minded approach that mypy uses (making str a subtype of unicode in Python 2 mode) actually seems to have caused very little trouble for our users. Though clearly it only finds a subset of potential str/unicode bugs.

Herst commented 8 years ago

(making str a subtype of unicode in Python 2 mode) actually seems to have caused very little trouble for our users

It caused troubles (or at least it didn't help alleviate them) with mixing str and unicode in a project I'm working on, see https://github.com/python/mypy/issues/2182

markshannon commented 8 years ago

I don't like the profusion of string types with odd semantics.

What about the following? Basically we define the types reflecting the actual MROs:

if PY2:
    class basestring(object): ...
    class str(basestring): ...
    class unicode(basestring): ...
    bytes = str
else:
    class bytes(object): ...
    class str(object): ...

Then, to deal with the implicit promotion of str to unicode in Python 2, we need a bunch of tables/stub files, plus an AsciiBytes type. E.g.

if PY2:
    @overload
    __add__(a: unicode, b: unicode)->unicode: ...
    @overload
    __add__(a: AsciiBytes, b: unicode)->unicode: ...
    @overload
    __add__(a: unicode, b: AsciiBytes)->unicode: ...
    @overload
    __add__(a: str, b: unicode)->unicode: ... # May warn if using this
    @overload
    __add__(a: unicode, b: str)->unicode: ... # May warn if using this
    @overload
    __add__(a: AsciiBytes, b: AsciiBytes)->AsciiBytes: ...
    @overload
    __add__(a: str, b: str)->str: ...
else:
     @overload
    __add__(a: str, b: str)->str: ...
     @overload
    __add__(a: bytes, b: bytes)->bytes: ...

As for the "text in spirit", that is just str, is it not?

gvanrossum commented 8 years ago

The daunting part is that Python's string types have many methods and they aren't all that regular. And the rules for what exactly to warn about are also subtle.

And no, "text in spirit" is not just str -- it's unicode, and some but not all uses of str. (See https://github.com/python/typing/issues/208#issuecomment-240556782 I think.)

vlasovskikh commented 8 years ago

@gvanrossum @JukkaL Yes, having to run a type checker twice for PY2 and PY3 is not that convenient, but generally it should help during porting, especially with (type checker dependent) options for disallowing untyped stuff.

If we give up on the idea of finding implicit ASCII conversion errors in PY2 code, why don't we make bytes and Text incompatible in both directions and allow str to be compatible with everything?

Having bytes and Text as separate types would help porting to PY2+3 to some extent (as shown on my figures above). For any "native" strings or in "complicated" situations people could use str then. And at some point during porting they could double-check their usages of str across their type hints to make sure that str is used correctly in respect to PY2+3.

Current mypy's behavior (promote str to unicode for PY2, but not the other way round) is rather unexpected:

from typing import Any

def f_str(x):
    # type: (str) -> Any
    pass

def f_bytes(x):
    # type: (bytes) -> Any
    pass

f_str(u'unicode')  # False error!
f_bytes(u'unicode')  # True error

Basically I'm proposing to return to this figure and stop at it:

plus

u'foo' -> Text
'foo' -> str
b'foo' -> bytes

vlasovskikh commented 7 years ago

@gvanrossum @JukkaL Could you please take a look at my latest comment?

PyCharm is moving towards adopting typeshed and I'm a bit worried about the underspecified behaviour of type checkers for text and binary data in Python 2. I have in mind an idea of the common test suite for mypy, PyCharm, and other type checkers to ensure compatibility (beyond syntactic correctness currently checked by tests/mypy_test.py and others). This str/bytes/unicode issue is one of the problems still unresolved in PEP 484.

gvanrossum commented 7 years ago

Sorry for the lack of response. Lots going on...

why don't we make bytes and Text incompatible in both directions and allow str to be compatible with everything?

Assuming that proposal is for Python 2, that's pretty much what I settled on as my last proposal for giving up. It's also why we changed typed_ast to distinguish between Bytes, Str and Unicode.

The problem with this is not just the question of how to define the str and bytes classes in typeshed (I started on this but found it tricky) nor how to treat them in mypy (that can be done, though I haven't looked into it) but most of all the decisions about where to use str or bytes everywhere else in typeshed. That seems a daunting task, especially since most Python documentation doesn't specify types. I suppose you can turn them all into str and rely on the "str is compatible with everything" rule, but if we do it that way we might as well define unicode = str in __builtins__.pyi and be done with it...

vlasovskikh commented 7 years ago

@gvanrossum Yes, I meant having incompatible bytes and Text, and str compatible with everything just for Python 2.

Since 2013 we had our repository of stubs with a PyCharm-specific legacy syntax for type hints. We solved the problem of text and binary data for Python 2 stubs (but not the ASCII conversions part) by introducing string as an alias that is compatible with both str and unicode. Our practice for annotating stuff from the standard library was that we almost always used string for the types of parameters and we used str and unicode where a function always returns a particular type, resorting to string otherwise.

We could use a similar approach for typeshed where we annotate parameters with str and try to be more specific (i.e. use Text or bytes) for return values when possible (and use str when it isn't).

JukkaL commented 7 years ago

Basically the new proposal seems to drop NativeStr as a separate type, but otherwise it's like the one before it? Here's a summary of how I understand it:

Python 2 type checking:

For text, use Text or unicode (same as now) and u'foo' literals.
For binary data, use bytes and b'foo' literals.
For cases where Python 2 and 3 have incompatible text types (or legacy code that isn't quite unicode clean) str and 'foo' literals may be used.
AnyStr would range over str, bytes and unicode. (The rules would likely be subtly more complex than this, but we can probably figure it out.)
str would be compatible with bytes and unicode. Both bytes and unicode would be compatible with str. bytes and unicode aren't compatible with each other.
- Compared to current mypy rules, we'd add unicode -> str compatibility (currently it's one way only: str -> unicode).

Python 2 stub changes:

We'd add a new bytes class to Python 2. str could be mixed in operations with bytes, similar to how they can be mixed with unicode already. bytes and unicode wouldn't mix.
Potentially every reference to str in Python 2 stubs would need to be checked, or we'd risk making the stubs less precise.
- My first idea would be to, as an initial approximation, just switch all argument and attribute types to bytes, and return types could remain as str. This would likely not break a lot of existing code that uses mypy. Interestingly, this is pretty much the opposite of the PyCharm practice.
- If a function clearly returns binary data, we should probably use a bytes return type.
Some unicode types in stubs would likely have to be changed to Union[bytes, unicode].

Type checking Python 3 would not change.

There are plenty of str references in Python 2 stubs:

$ ag '\bstr\b' stdlib/2 stdlib/2and3 third_party/2 third_party/2and3/|wc -l
    2477

unicode and Text are less frequent:

$ ag '\b(Text|unicode)\b' stdlib/2 stdlib/2and3 third_party/2 third_party/2and3/|wc -l
     448

vlasovskikh commented 7 years ago

@JukkaL Yes, my understanding of the proposal matches your description.

As for the proposed changes in typeshed, @JukkaL could you elaborate on your idea of using bytes by default for parameters and attributes in Python 2 stubs?

I would say that in Python 2 it's almost always legal to pass both text and binary data to both a function that expects text and to a function that expects binary. (Assuming we won't try to catch implicit ASCII conversions). So I would do it the other way round: use str by default for parameters and switch to bytes or Text if it is always an error to pass a value of the other type. What's your opinion on that?

JukkaL commented 7 years ago

@vlasovskikh I agree that your proposal would result in fewer false positives and is worth considering.

My idea was to generalize how mypy currently works, and to catch an interesting subset of dangerous implicit ASCII conversions. Mypy doesn't consider unicode to be valid argument to a function expecting a str, so changing the argument type to bytes would change little -- unicode would still not be accepted. Similarly, if a function currently returns str, mypy lets you pass the return value to a function expecting unicode. If we preserve the return value type as str, this wouldn't change.

Mypy currently doesn't work well with unicode_literals. If that is an important use case, your proposal is likely better. Also, mypy behaves a little arbitrarily -- it catches implicit conversions from unicode to str, but not vice versa, even though both can fail at runtime. The implicit hypothesis here is that code explicitly using unicode or u'foo' literals is likely to encounter non-ASCII values, whereas code using just plain str is more likely to be ASCII only, and thus implicit conversions are more likely to be safe from str than from unicode.

If we follow your proposal, I'd expect most existing code designed for mypy to still pass type checking, but mypy would catch fewer string conversion errors. Users could switch some str types in their code to bytes, and change some 'foo' literals to b'foo' literals. This would let mypy catch additional errors, and it would likely also make their code more Python 3 compatible.

It seems that having precise (non-str) return types in stubs would work together with imprecise argument types. We can have precise argument types OR return types, but not both, as then we'd reject too many implicit conversions.

vlasovskikh commented 7 years ago

@JukkaL So we agree on everything except for how we should change the types of arguments and return values in the typeshed stubs to reflect the proposal about separate bytes / str / Text and the compatibility rules discussed above.

Yes, I see your point about changing str to bytes being compatible with how mypy works for now. My objections mostly come from the fact that it's quite unsatisfying for the user to see false positive errors. We could proceed by changing str in arguments to bytes and keeping str in return values and then fix all the incoming issues about false positives. (I'm working on the common test data for typeshed for that, I'll share my prototype with the typeshed contributors in a few weeks). Another possibility is to start with str in the arguments and fix false negatives. From our perspective, the latter is more preferable, but I don't object to the first option assuming we can easily fix typeshed stubs when we see a false positive error. What are your thoughts?

markshannon commented 7 years ago

Please, please can we not mangle typeshed to work around current inadequacies in mypy. typeshed should be universal and it types need to be correct, even if they are inconvenient.

In case I am missing anything, are there any examples of standard library code involving text that cannot be correctly stubbed with the simple types str, unicode, bytes and AsciiBytes? [ I am assuming that sys.setdefaultencoding() is not called and that the implicit conversion AsciiBytes->unicode is OK, but that the conversion bytes->unicode is an error. ]

A few concrete examples might help this discussion.

Modifier the typechecker to handle different default encodings is left as an exercise :smile:

JukkaL commented 7 years ago

@markshannon Which current inadequacies in mypy we shouldn't work around? Being correct, whatever that means in this context, is not the only goal for stubs -- tools will also need to be able to type check effectively using the stubs without confusing users. Additionally, we have users using the current typeshed in production, and we don't want to cause them a big headache by requiring a lot of rework of existing annotations, if we can avoid that. We also need a volunteer to implement the proposal, including any necessary typeshed changes.

We gave up on the AsciiBytes idea, in part because annotating everything that may preserve the 'asciiness' of arguments precisely would be a huge hassle.

vlasovskikh commented 7 years ago

@markshannon We are discussing the opposite: how to make text vs binary data in Python 2 compatible with mypy, PyCharm, and other tools so that typeshed is universal for everyone and developers have the tools to annotate their PY2+3 text and binary data in a compatible way.

As @JukkaL mentioned, we failed to come up with a good working proposal about catching implicit ASCII conversions, but we still see value in having PY2+3 text and binary data standardized.

It turns out we've come up with a proposal that @JukkaL and me find satisfying. The remaining question is how to migrate typeshed to these new text / binary types.

JukkaL commented 7 years ago

Here's one idea for migrating typeshed.

First phase:

Initially just add bytes to typeshed and update str to mix with bytes. Update AnyStr.
Leave all other existing annotation as they are -- i.e., mostly str / unicode.
Update mypy to support the new compatibility rules, and support b'foo' literals in Python 2 mode.
Experiment with various codebases to evaluate whether the idea looks feasible. Try type checking the same code in Python 2 and 3 modes. It's still possible that something goes wrong here.

Second phase:

Decide what to do with str types in typeshed. Gradually migrate typeshed annotations to use whatever scheme we come up with.

gvanrossum commented 7 years ago

I'm happy to go along if we can work something out. I do have some questions about the actual class definitions for bytes, str and Text, and their subclassing relationships.

The following is entirely in the context of PY2 -- in PY3, we have bytes != str and str == Text (and bytes != Text follows), but in PY2 the proposal is to have bytes == str, str == Text, and bytes != Text.

This violates some transitive property, but that in itself doesn't bother me, since it's similar (at a smaller scope) to Any (e.g. int == Any, Any == str, but int != str, using '==' to spell "is compatible with in both directions").

I expect that the proposal will actually make type-checking involving Text weaker, since the following example currently fails, but should be accepted under the new rules:

a = ''
b = u''
a = b  # Error here
b = a  # This is OK

I guess I am okay with that since in reality it's often true that passing unicode where str is expected will work.

Some more questions: If I have something of type List[str] it should be compatible with List[bytes] and List[Text] right? Because List[Any] is compatible with List[int].

gvanrossum commented 7 years ago

Re: Jukka's proposal: ISTM the four bullets in the first phase must all be closely coordinated -- after the first bullet is done (add separate bytes and str classes to typeshed and update AnyStr) there is code that will fail to type-check in the current version of mypy.

This is pretty much what I have attempted so far, and what prompted me to give up since I couldn't see how to easily change mypy according to the 3rd bullet. I will resurrect my branches and turn them into PRs so you can see.

JukkaL commented 7 years ago

@gvanrossum It would be logical for List[str] to be compatible with List[bytes] and List[Text], similar to how Any behaves. If that turns out to be non-trivial to implement in mypy, for the first experiments this could be left out.

For code that only uses str and unicode, the proposed rules seem to be equivalent to str being an alias for unicode, or at least pretty close. However, once we add some bytes types we will be able to do useful type checking.

gvanrossum commented 7 years ago

OK. Note that mypy already distinguishes between b'x' and 'x' in PY2 mode -- it gives them the types builtins.bytes and builtins.str, respectively. It's just that currently bytes is an alias for str.

python / typing

Decide how to handle str/unicode #208

Gradual Byting

Text in Spirit

Compatibility Tables

Irregularities

encode() and decode()

__str__() and __repr__()

Rationale

ASCII Types

Pros

Cons

Gradual Byting

Questions

Re: Rationale

Re: ASCII Types

Re: Gradual Byting

Original Gradual Byting

Guided Gradual Byting

Summary of Proposed Changes

Idea

1. Add type hints for all declarations in your code

2. Remove all str entries in type hints in favor of Text/bytes/NativeStr/etc.

3. Enable strict str checking

Extra options

Footnotes

`encode()` and `decode()`

`str()` and `repr()`

2. Remove all `str` entries in type hints in favor of `Text`/`bytes`/`NativeStr`/etc.

3. Enable strict `str` checking