Closed gvanrossum closed 4 years ago
A new proposal: use the existing triad bytes-str-unicode.
Let me try to explain the new proposal with more care.
I am most interested in solving this issue for straddling code; my assumption is that most of the interest in type annotations for Python 2 has to do with that. (This is the case at Dropbox, and everyone who has enough Python 2 code to want to annotate it probably should be thinking about porting to Python 3 anyway. :-)
In the proposal, str has a position similar to the one that Any has in the type system as a whole -- i.e. assuming we have three variables b, s, t, with respective types bytes, str, (typing.)Text, then: b is compatible with s, s is compatible with b and t, t is compatible with s, but b and t are not compatible with each other. IOW the relationships between the three types are not expressible using subtyping relationships only. (It's actually a little more complicated, I'll spell out the actual rules I'm proposing below.)
Before we get to that, I'd like to discuss the use cases for the proposal. In straddling code we often have the problem that some Python 2 code is vague about whether it works on bytes or text or both. The corresponding Python 3 code may work only on bytes, or only on Text, or on both as long as they are used consistently (i.e. AnyStr), or possibly on Union[bytes, Text]. To find such cases, maybe we could just type-check the code twice, once in Python 2 mode and once in Python 3 mode. If it type-checks cleanly in both modes, it should run correctly in both Python versions too (insofar as type-checking cleanly can ever say anything about running without errors :-).
However, when we have a large code base, it is usually a struggle to make it type-check cleanly even in one mode, and typically we start with Python 2. So if we have code that runs correctly using Python 2 and type-checks cleanly in Python 2 mode, and we want to port it to Python 3, requiring it to type-check cleanly in Python 3 mode is setting the bar very high (as high as expecting it to run correctly using Python 3).
Therefore I am proposing a gradual approach. Similar to the way we start by type-checking an untyped program (which by definition should type-check cleanly, since all types are Any -- even though in practice there are some holes in that theory), I propose to start with a Python 2 program that uses str for all string types, and type-checks cleanly that way, and gradually change the program to replace each occurrence of str with either bytes or Text (or one or the rarer alternatives like AnyStr or Union[str, Text]). That way we can gradually tighten the type annotations, keeping the code type-check clean as we go.
Just like, when I define a function with def f(x: Any)
, I can call f(1)
, f('')
and f([0])
and it's all the same to the type checker, and f's body I can use x+1
, x()
or x[0]
, the idea here is that a function defined with def g(s: str)
can be called as g('')
, g(b'')
or g(u'')
, and in g's body I can use s+b'xxx'
, s+'yyy'
or s+u'zzz'
.
The actual details are a bit subtle. I'm proposing (in builtins; recall that this is for Python 2 only):
class bytes
with (mostly) the methods currently present on str
, with arguments of type bytes
and returning bytes
(as appropriate).class str(bytes)
with overloaded methods that return str
if the other argument is a str
, returning bytes
for `bytes (more or less).class unicode
unchanged from its current definition, keeping typing.Text
as a pure alias for it.The subclassing relationship between bytes and str makes str acceptable where bytes is required. In mypy we can add a "promotion" from bytes to str to enable compatibility in the other direction. Mypy (in Python 2 mode) has an existing promotion from str to unicode that accepts str where unicode is required. I don't actually propose to make unicode acceptable where str is required (this is a deviation from the "str is like Any" idea). Because promotions are not transitive (unlike subclassing), bytes is not acceptable where unicode is required, nor the other way around.
There is still a lot more to explain. I want to show in detail what happens in various cases, and why I think that is right. I need to explain the concept of "text in spirit" to motivate why I am okay with the difference between these rules and the actual workings of Python 2. I want to go over some examples involving container types (since that's where the "AsciiBytes" proposal went astray). And I need to give some guidelines for stub authors and changes to existing stubs. (E.g. I think that Python 2 getattr() may have to be changed to accept unicode.)
[But that will have to wait until tomorrow.]
t [Text] is compatible with s [str]
This contradicts with this part of the proposal:
I don't actually propose to make unicode acceptable where str is required
Also, the example with def g(s: str)
suggests that it can be called as g(u'')
, with
a unicode
argument. This should be clarified and made consistent across the proposal, as
otherwise things get confusing.
Because promotions are not transitive (unlike subclassing)
Mypy actually considers the promotions int
-> float
and float
-> complex
transitive, and int
can be promoted to complex
. We could change the language
to something like "these promotions are not transitive" or we could perhaps treat
the int
-> complex
promotion as a separate promotion.
Other notes:
str
methods would return unicode
if the other argument is unicode
.
Currently this is left unspecified. It could be useful to have table of the result types
of s1 + s2
for all combinations of str
, bytes
and unicode
(9 cases).AnyStr
would have to range over str
, bytes
and unicode
. This means that we may want
to give different meanings to IO[str]
and IO[bytes]
, for example.List[Any]
is compatible with List[int]
and vice versa in mypy (though PEP 484/483
seems to be silent on this), but should List[str]
be compatible with List[bytes]
, and
vice versa? I'd argue that List[str]
and List[bytes]
should be incompatible, similar to
how List[int]
and List[float]
are incompatible, but I don't have a strong opinion on
this.This contradicts with this part of the proposal
That's why I wrote It's actually a little more complicated. I am having a hard time summarizing the proposal briefly and also writing it up in detail without contradictions between the two. In case of conflict the detailed version should win and the summary seen as a hint at most. Maybe we'll have to use more vagueness in the summary to avoid confusing experts who know the terminology.
the example with
def g(s: str)
suggests that it can be called asg(u'')
More imprecision in the summary. :-( It really can't, unless g()
is implemented in C in a certain way, e.g. getattr(x, u"foo")
. But for a Python function this is wrong. Actually, for a Python function, the other way around is also wrong. But nevertheless the promotion allows it. Just like the promotion from int to float is technically wrong in Python 2, as shown here:
def f(a):
# type: (float) -> float
return a/2
assert f(3) == 1.5 # Fails, it returns 1
I will try to spec out the true compatibility as a bunch of tables.
str methods would return unicode if the other argument is unicode
Yes. There are already some overloads like that. The bigger difference will be that these overloads won't exist for bytes+unicode.
AnyStr would have to range over str, bytes and unicode. This means that we may want to give different meanings to IO[str] and IO[bytes], for example.
Yes. In fact IO[unicode] would only be obtainable by calling io.open().
should List[str] be compatible with List[bytes], and vice versa?
I think not (so we agree here). This will lead to some of the same issues as I ran into when trying to implement the AsciiBytes idea, but the issues will much less common.
[In the next installment I will try to construct the tables of compatibilities. I will also talk about the concept of "text in spirit".]
(This is still pretty messy. But I promised I would explain the concept.)
I'll sometimes say that some variable in Python 2 is "text in spirit". For example, in getattr(x, name)
, name
is "text in spirit". In this case I mean two things with this: first, that in Python 3 the name argument to getattr() has type str, not bytes. Second, that even in Python 2, the name is an identifier, and even though you can write getattr(x, '\xff\x01')
, that would be useless.
Note that text encoded as bytes is not "text in spirit". The requirement is that the corresponding Python 3 API uses str
, and the Python 2 API supports bytes or unicode, though not necessarily all bytes or all unicode -- e.g. getattr() only accepts a unicode name if it contains only ASCII characters, even though it doesn't make that requirement when the argument is str.
Basically the point of "text in spirit" is to make the argument that an API should not use bytes even though it may accept non-ASCII str instances. But I have to do more exploration before I decide how important this concept is.
Here's another table that could be useful -- if we define def f(s: s1) -> None: ...
, is a call with an argument of type s2
valid, when s1
and s2
range over str
, bytes
and unicode
.
[UPDATE: made function calls primary, per Jukka's suggestion below]
Let's start by stating the compatibility between expressions of types bytes, str, text and functions with arguments of those types. Each row corresponds to a declared argument type; each column corresponds to the type of an expression passed in for that argument.
Argument type | xb: bytes | xs: str | xt: Text |
---|---|---|---|
arg_b: bytes | Yes (same) | Yes (str <: bytes) | No |
arg_s: str | Yes (promotion) | Yes (same) | ??? |
arg_t: Text | No | Yes (promotion) | Yes (same) |
The above table also describes compatibility of expressions with variables (assuming the type checker, like mypy currently, doesn't just change the type of the variable). Note that I'm not decided yet whether to allow passing a Text value to a str argument, but I'm inclined to put "No" there, even though that breaks the illusion of "str as the Any of string types" (IOW gradual byting :-).
Next let's describe the return type for expressions of the form x + y
where x and y can each by of type bytes, str, or Text.
x | yb: bytes | ys: str | yt: Text |
---|---|---|---|
xb: bytes | bytes | bytes | ERROR |
xs: str | bytes | str | Text |
xt: Text | ERROR | Text | Text |
Note that this table is more regular and I'm pretty confident about it.
encode()
and decode()
For bizarre reasons, in Python 2 both str and unicode support both encode() and decode(). This makes no sense, e.g. u'abc'.decode('utf8')
is equivalent to u.abc'.encode('ascii').decode('utf8')
, and 'abc'.encode('utf8')
really means 'abc'.decode('ascii').encode('utf8')
.
I propose to rationalize this to the extent possible, as follows:
This would mean complete removal of unicode.decode() from the stubs, since it basically always means some terrible misunderstanding happened. For variables declared as bytes, it would likewise remove the encode() method, whose use would point to a similar (but opposite) misunderstanding. For str we remain generous (since using str means the code probably hasn't received enough attention from the straddling police).
__str__()
and __repr__()
The return types of bytes.__str__()
and bytes.__repr__()
are still str
, because that's how they are constrained by object
. (FWIW these are examples of methods returning "text in spirit" strings.)
A reason we might prefer to use a call instead of an assignment as a basis for the table is that some type checkers like to infer a new type from assignment, and thus arbitrary assignments are considered correct -- they just redefine the type of a variable.
OK, edited the text.
If we don't do the Text
-> str
promotion (which seems reasonable), then the original "gradual byting" story may need tweaking, as the first step would be to annotate with str
and Text
only (not just str
, because unicode
literals wouldn't be compatible with it), and the gradual byting migration would migrate some str
types to bytes
(or maybe unicode
). Also, the gradual migration may involve changing some ''
literals to b''
literals
Example first phase annotation where we'd need Text
:
def utf8_len(x: Text) -> int:
return len(x.encode('utf8'))
utf8_len(u'\u1234')
Here we'd need to use Text
unless we include the Text
-> str
promotion.
@gvanrossum @JukkaL I would like to join your discussion.
Here is a summary of the current ASCII types and gradual byting proposals as I understand them.
(This piece is here to make sure we’re solving the same problem.)
The changes in text and binary data handling in Python 3 is one of the major reasons people cannot run their Python 2 code on Python 3. These changes were:
str
to unicode
and vice versa conversions using the ASCII encodingThe most viable approach to porting to Python 3 is via porting to Python 2+3.
Therefore we need a way to make this transition from a Python 2 program with implicit conversions to a Python 2 program with less implicit conversions to a Python 2+3 program that runs on both versions but still contains some implicit conversions to a Python 3 program.
The idea is to introduce two new types, one new types compatibility rule and special type inference rules for binary and text literals.
The new types are:
class typing.ASCIIBytes(bytes): ...
class typing.ASCIIText(typing.Text): ...
The ASCIIBytes
type is compatible with ASCIIText
for Python 2.
If a text or binary literal contains only ASCII characters then type checkers should infer the corresponding ASCII types instead of regular text / binary types.
List[ASCIIBytes]
is not compatible with List[str]
that causes lots of errors according to Guido’s experiments. As a workaround, we might make ASCIIBytes
compatible with any TypeVar
that is constrained by ASCIIText
or Text
despite it’s variance. The cost of this workaround is more false negatives for unsafe modifications of invariant collections of strings.str.upper()
.A proposal by Guido described here.
The idea is to distinguish the str
in type hints from bytes
and Text
(unicode
in Python 2). Then there are new types compatibility rules:
bytes
is compatible with str
str
is compatible with bytes
Text
is maybe compatible with str
(??? undecided yet)str
is compatible with Text
List[bytes]
being not a subtype of List[str]
is resolved here? (see the similar issue in the ASCII types proposal)Text
is not compatible with str
then str
is no longer a safe staring point in the Python 3 migration process. Won’t it confuse people?getattr()
that do implicit text to bytes conversion?What do you think of the workaround for invariant collections of ASCIIBytes
?
Could you please answer my questions about the gradual byting proposal?
Writing my answer...
Agreed, although your way of describing the changes makes it sound like Python 3 is a step back from Python 2 in this respect; I believe the opposite (PY3 is better than PY2).
The key part is that some things around strings changed and we want to provide a gradual way to convert PY2 via straddling (2+3) to (eventually) PY3.
The proposal is mostly concerned with how to write string types for the straddling case (in a way that works in PY2 and is idiomatic PY3).
I'm not sure you characterized the proposal the same way as I heard it (as retold by @ddfisher), but you're pretty close.
IIRC the proposal actually made ASCIIText compatible with all bytes, and ASCIIBytes compatible with all Text, in Python 2. So e.g. getattr(obj, name)
requires name to be str in Python 3, but in Python 2 it accepts all bytes (as an alias for str) and ASCIIText, but throws UnicodeEncodeError for Text instances containing non-ASCII characters. "Morally" I think it's text, not bytes, since in Python 3 it only accepts text strings, and the best annotation for straddling code is name: str
.
As an example that goes in the other direction, in Python 2, s.encode('utf8')
works for all Text and for ASCIIBytes, but throws UnicodeDecodeError if s
is a str bytes object containing non-ASCII bytes. In Python 3 only text strings have this method. Again s
s "morally" text, but we have to type it differently (I'd propose s: Text
).
I'd like to point out that in both these examples (taken from the behavior of builtins, there are many more like it) declaring the argument or variable as ASCIIBytes or ASCIIText is inappropriate, since while ASCII characters/bytes are accepted in either persuasion (text or bytes), the full range of one or the other base types (bytes or Text) is still accepted.
Another problem I have with the ASCII proposals is that it emphasizes literals too much. Yes, in examples we often like to write things like getattr(obj, 'xxx')
and then it works out nicely that the argument is an ASCIIBytes. But in real code it's much more likely that you're computing a name from some other information and then pass it to getattr(obj, name)
. That computed name is much more likely to have the (inferred or explicitly declared) type str
, or if it came from a unicode-aware computation it could have the type Text
. In the latter case, inferring (i.e., proving) through such a computation that the value is in fact ASCIIText is usually too hard.
(NOTE: The nickname "gradual byting" may actually be a bad pun, as in the end the compatibility rules are more complicated than those for Any
in gradual typing.)
How the issue with
List[bytes]
being not a subtype ofList[str]
is resolved here?
(You probably meant supertype, since I propose to make bytes a supertype of str in PY2.)
It's not completely resolved, but I think it's more reasonable to ask people to distinguish between bytes
and str
in their annotations (and thinking!) than to start introducing the ASCII types (which have no use in PY3 code).
My main reason that I find this less of a problem than the corresponding problem with List[str]
vs. List[ASCIIBytes]
is that we actually have bytes literals in PY2. (It's a little tricky to keep track of them in the parser, but not impossible, and I plan to do it. So if you meant a list of bytes, use [b'xx', b'yy', b'zz']
. Also, making the distinction carefully is useful going forward to pure PY3 code, while distinguishing between ASCII and non-ASCII is artifical (only PY2 cares about them).
If
Text
is not compatible withstr
thenstr
is no longer a safe starting point in the Python 3 migration process. Won’t it confuse people?
This situation is inherently confusing, because in PY2, sometimes str+unicode works, and sometimes it doesn't. As long as you don't know, you may be better off with Any
.
What should be the type of functions like
getattr()
that do implicit text to bytes conversion?
I think we have two choices: str
or Text
. Both are "morally" text but in PY2, str
has preference for 8-bit characters while Text
prefers Unicode characters. Since this is a C function that internally works with 8-bit characters (in PY2) I think it should be typed as str
and from this I concede that making Text
compatible with str
may actually be the right thing to do.
So in the end maybe "gradual byting" is correct. Ideally for straddling code, you should run mypy twice, once with --py2
and once without. The recommendation would then be to strive for the following, in straddling code (or code striving to become straddling):
bytes
where PY2 has strings that must be bytes in PY3str
where PY3 has strings and it's complicated in PY2Text
where PY2 uses UnicodeMypy in --py2
mode would have to learn that bytes <-> str
and str <-> Text
but not bytes <-> Text
, and it would have to assign the correct type based on the form of literal:
b'x' -> bytes
'x' -> str
(unless from __future__ import unicode_literals
; then Text
)u'x' -> Text
(an alias for unicode
)BTW I find from __future__ import unicode_literals
an anti-pattern that does more harm than good, and I now recommend against it.
The full gradual byting idea (bytes <-> str <-> Text
, but not bytes <-> Text
) sounds logical and easy to explain. I particularly like the following about it:
str
type hints in your Python 2 code base, you'll get no false positivesbytes
and Text
and get better type checking, while keeping str
in places where it's complicated(The type checking rules for generic types like List[str / bytes / Text]
are still a bit unclear. I guess the idea is to pretend that List[str]
is compatible with List[bytes]
and List[Text]
and vice versa.)
What I don't like is that gradual byting gives up on checking for implicit ASCII conversions. Basically we'll be able to check the two distinctly marked bytes
and Text
subsets of the program and won't be able to tell anything about implicit ASCII conversions since str
is compatible with both bytes
and Text
.
But it seems that it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeError
s and UnicodeEncodeError
s in PY2 programs. @gvanrossum @JukkaL Do you agree with it?
I would like to experiment with the idea of gradual byting in PyCharm for the next few days to see if there are any concerns with it. I'll report about my findings later this week.
type checking rules for generic types like
List[str / bytes / Text]
are still a bit unclear
My current inclination is not to do anything special about these, because List is invariant. Although if str
was really analogous to Any
here it would indeed work. I think experiments will have to decide whether it's needed.
it's more important to give developers tools for gradually widening the subsets of their PY2 programs that handle strictly binary or strictly textual data in order to make them more PY2+3 compatible rather than helping to catch all the UnicodeDecodeErrors and UnicodeEncodeErrors in PY2
Yes, that's the most important use case we have for mypy at Dropbox -- we want our code to become more Python 3 ready.
Mixing bytes and Text is a type error. But it's harder to argue that mixing ASCII and non-ASCII is a type error. I don't like to treat strings containing only ASCII characters as a subtype of bytes or Text, because dynamic sources of characters (other than literals in the source code) don't typically tell you whether they can ever return non-ASCII characters.
As an analogy, let's say we wanted to treat non-negative integers as a subtype of int. If we define a "type" as a set of values, this is certainly a reasonable thought, and non-negative integers are closed under addition and multiplication (just like ASCII strings are closed under concatenation and slicing). But there are few input functions that return only non-negative integers -- int() in particular can definitely return a negative int. So it's hard to enforce the non-negativity of integers being processed by a program without explicit range checks or complicated proofs that a certain algorithm preserves that property.
I feel it's similar for the ASCII-ness of bytes and Text -- a function that reads a string from a file or socket (or from e.g. os.environ()) has no particular reason to believe that the file will only contain ASCII characters.
I'm looking forward to the outcome of your experiments. In the meantime I will also try to look into a more complete set of changes to typeshed and mypy.
I'm done with my experiments with the idea of gradual byting.
I've created a proof-of-concept implementation of gradual byting in PyCharm by modifying __buitlin__.pyi
from Typeshed and tweaking our type inference engine and type checker. Then I tried to port some real-life code from PY2 to PY2+3 using the modified IDE.
What I've learned from my experiment with the original gradual byting proposal is that the type checker doesn't help in porting PY2 code to PY2+3. I mean if you already use Text
and bytes
alongside with str
then yes, it helps you to some extent.
But overall you get no guidance on how to proceed with porting your code. The type checker doesn't tell you if there are any variables or functions with no type hints. It doesn't promote the use of Text
and bytes
instead of str
. It doesn't catch most of the text/binary data compatibility errors.
During my further experiments I came up with the following guided process for making PY2 text/binary data handling more PY3-like. The original gradual byting is a step in this process.
Text <-> str <-> bytes
, but Text
and bytes
are not compatibleu'foo' -> Text
, 'foo' -> str
, b'foo' -> bytes
typing.NativeStr
as an alias for str
that says explicitly that it's a native stringAny
for declarationsstr
in type hints (use Text
/ bytes
/ NativeStr
instead)str
checks (disables gradual byting promotions, use cast()
if you're sure)Any
for expressionsstr
literals(Note: It includes some pictures, please see the comment on the GitHub page).
Hypothesis: Most of text/binary data in PY2 programs can be converted to either Unicode data or 8-bit data in PY2+3. Native str
strings (the ones that are strictly 8-bit in PY2 and Unicode in PY3, i.e. mixed in PY2+3) are the minority. Handling native strings causes many problems while porting from PY2 to PY2+3. Type checkers should make you aware of these problems and provide some help in reducing the amount of native strings.
In PY3 you have a clear separation of text and binary data: they are not compatible with each other. In PY2 things are complicated because of the implicit conversion between text and binary data using the ASCII encoding.
If you want to make your PY2 code PY2+3 compatible (straddling), you have to make it more PY3-like in respect of text/binary data separation. A good way to proceed is to start putting type hints into your code so that a type checker would be able to check your code for correctness.
The steps of the proposed approach to porting are described below. You may start with no or some type hints in your code. You may proceed module-by-module or modify the whole program at once.
Use the "Warn about implicit Any
for declarations" type checker option to get notified of all the places where type hints are missing.
For text and binary data you have the following options:
Text
when data is Unicode in PY2 and PY3bytes
when data is 8-bit string in PY2 and PY3AnyStr
when your code works with both Unicode and 8-bit strings in PY2 and PY3 as long as all the function arguments are of the same type (or use Union[Text, bytes]
if it doesn't matter)NativeStr
when data is Unicode in PY3 and 8-bit string in PY2 (see the footnotes about NativeStr
)str
otherwise (if things are complicated)str
entries in type hints in favor of Text
/bytes
/NativeStr
/etc.Use the "Warn about str
in type hints" type checker option to get notified of all the remaining occurrences of str
in type hints.
Go through all the places you marked with str
as complicated and figure out which of the text/binary types is actually appropriate here.
The purpose of this step is to make the native string subset as small as possible since a) native string operations are the hardest to port; b) your code will look more PY3-like with mostly Text
and bytes
subsets.
str
checkingUse the "Strict str
checks" type checker option to enable strict separation of Text
, str,
and bytes
data in your code.
The remaining type checker warnings at this step show the most tricky parts of your text/binary data handling code that has to be carefully written to become PY2 and PY3 compatible. It may involve:
Text
, NativeStr
, and bytes
againif PY2
conditionstyping.cast(<type>, <value>)
if you're sure what are you doing and you need a way make the type checker happyThese type checker options might be helpful during PY2 to PY2+3 porting if your code is not 100% type hinted:
Any
for expressions as wellstr
literalsI found these extra options very useful when you have more modules to port or you use third-party libraries with no type hints / stubs.
PyCharm doesn't support the stubs from Typeshed yet, it's still a work in progress.
The idea of an option for warnings about declarations with no type hints comes from the --noImplicitAny
option of the TypeScript compiler. It is used heavly in the TypeScript community for testing the TypeScript stubs of untyped JavaScript libraries.
typing.NativeStr
is a new type needed to help people get rid of ambiguous str
(is it really a native string or is it a marker that things are complicated?). It could be an alias to str
. There should be an option to warn about any str
and unicode
entries in type hints.
@gvanrossum @JukkaL I'm looking forward to your feedback.
Sorry, I'm tied up at the core python sprint this week. I hope to have time next week!
@vlasovskikh Thanks for the detailed write-up! Your approach sounds mostly reasonable. If @gvanrossum agrees, hopefully we can can experiment with it and mypy.
A few things I'm not sure about:
1) The str
/ NativeStr
distinction
This could be useful during migration, but I'm not sure if users will find it easy to understand. An alternative would be to propose that users define a similar type alias by themselves instead of including it in typing
.
2) Strict str
checking
The implications of a mode that requires casts between str
and other string types for Python 2 are still unclear to me. Stubs would potentially require things like Union[str, Text]
in places for things to work seamlessly, without needing seemingly redundant casts from str
to Text
/bytes
when interacting with library modules. I'm not sure how much of a problem this would be. Also, we'd need a separate stub for the str
class in this mode.
3) AnyStr
Would AnyStr
range over str
, bytes
and Text
? Functions that use AnyStr
would likely now be a little tricky to write in some cases. Consider this function:
def f(x: AnyStr) -> AnyStr:
return x + 'a'
This would be fine in Python 2 mode but it wouldn't work in Python 3 or strict str
checking mode. Here's a straightforward straddling implementation that actually wouldn't work, since given a str
argument, the return type would be bytes
, not str
in Python 2 mode:
def f(x: AnyStr) -> AnyStr:
if isinstance(x, Text):
return x + u'a'
else:
return x + b'a'
This may have to written like this, which seems a bit excessive but perhaps still reasonable:
def f(x: AnyStr) -> AnyStr:
if isinstance(x, Text):
return x + u'a'
elif isinstance(x, str):
return x + 'a'
else:
return x + b'a'
It seems that the final example would also work in the strict str
checking mode.
@JukkaL Replying to your points:
1) The str
/ NativeStr
distinction
I also thought about not introducing typing.NativeStr
and suggesting the users to create their own aliases.
An advantage of having typing.NativeStr
is that if we tell people to get rid of str
(since this type doesn't provide clear distinction between text and binary data) we better suggest some easy to use alternatives. The "Add your own alias to str
and use it instead if you want to suppress type checker warnings" sounds less persuading than "Get rid of all str
and use typing.NativeStr
if you really need native strings".
On the other hand, when a person enables the strict str
checking, it doesn't matter if they had typing.NativeStr
or their custom MyStr
or they still have str
. It will be a type checking error in this mode anyway.
So I'm not sure about this one. It just looked to me as a convenient step in the porting process between having no distinction between bytes <-> str <-> Text
and enforcing strict str
checking.
2) Strict str
checking
The point is that by default str
checking is not strict, i.e. gradual byting is the default. So any questions about problems with strict str
checking are really about this specific mode when a person is trying to make their code more PY2+3 compatible.
I believe that the standard library (or in fact any code) should not have type hints that mix str
with any other types (e.g. Union[str, Text]
shouldn't be used). Could you please give any examples from the standard library where such type hints could be useful / necessary?
The strict str
checking mode helps to find and fix those places where developers haven't decided what data they really handle in their code: Text
or bytes
or both (conditionally) or (on rare occasions) native strings. Therefore the strict str
checking mode shouldn't not let these places go unnoticed. We have the default gradual byting mode for that.
Yes, I imply that Text
, bytes
, and str
all have separate stubs as in the original gradual byting proposal.
3) AnyStr
I'm not sure about this one.
In the default gradual byting mode it doesn't really matter if ranges over str
/ Text
/ bytes
or just over Text
/ bytes
.
As for the strict str
checking mode it looks like AnyStr
could still range just over Text
and bytes
, but maybe there should be a special type checking rule that str
is still compatible with AnyStr
as AnyStr
covers all the variants.
I believe that the standard library (or in fact any code) should not have type hints that mix str with any other types (e.g. Union[str, Text] shouldn't be used). Could you please give any examples from the standard library where such type hints could be useful / necessary?
I'm thinking about cases where the code is using NativeStr
. I'm not sure what would be the most common use cases for NativeStr
, though. getattr
might be one, but the latest plan is to give the second argument the type Text
, I think, so perhaps using Text
and u'x'
literals with getattr
and friends is okay. However, idiomatic code would use str
literals, so here support for NativeStr
seems reasonable.
Yes, I imply that Text, bytes, and str all have separate stubs as in the original gradual byting proposal.
But also the stub for str
would likely be different when using strict str
checking, so that u'foo' + 'bar'
would be disallowed in the latter mode but not by default. When using normal checking str.__add__
could accept Text
arguments, but when using strict str
checking, it would only accept str
arguments.
As for the strict str checking mode it looks like AnyStr could still range just over Text and bytes, but maybe there should be a special type checking rule that str is still compatible with AnyStr as AnyStr covers all the variants.
Would this mean that AnyStr
would range over Text
and bytes
when type checking generic functions where AnyStr
is bound, but when calling such a generic function, AnyStr
could be substituted with str
as well? If yes, that sounds reasonable.
Example:
def f(x: AnyStr) -> AnyStr: # OK, here AnyStr ranges over bytes and Text
if isinstance(x, Text):
return x + u'a'
else:
return x + b'a'
y = f('x') # OK, here str is also valid. The type of y would be str.
@JukkaL and I chatted "offline" about this and I'm writing up a few thoughts.
Thanks for your comments. I will reply the next week as soon as I'm back from PyCon JP.
@gvanrossum Yes, your idea of having a separate type String
for "native" strings which doesn't mix with others is very similar to my proposal. And in PY2 AnyStr
may range over it as well.
I'm a bit worried about the name String
though. It may be unclear for those who are new with this problem why we keep adding new things that are synonyms to str
: AnyStr
, Text
, and now String
. NativeString
says at least a bit more about the purpose of this type. Also this terminology of "native" strings is already used, for example, in PEP 414: Explicit Unicode Literal for Python 3.3. This PEP criticizes some parts of PEP 333: WSGI 1.0.1 related to using Latin-1 in "native" strings, but suggests the useful distinction of PY2+3 compatible text strings / "native" strings / binary data.
Regarding the mentioned flags / options, they could be PyCharm-specific. I don't propose to make them a part of the specification.
To sum up the updated proposal:
Text
String
(or NativeString
, or something else?)bytes
str
in PY2 compatible with text strings and native strings and binary data
class str
in typeshed to distinguish it from the othersAnyStr
range over text strings / native strings / binary dataThose who want to port their code to PY2+3 should follow these steps:
unicode
and str
as not PY2+3 compatibleWhat do you think about it? If this idea is concrete enough for experimenting maybe I should write a separate PEP about it or update PEP 484 in a pull request?
OK, I found more references to "native string", e.g. http://python-future.org/what_else.html#native-string-type so let's use NativeString. (It always means "str" there.) We also now have typed_ast 0.6.0 which tells us the difference between b'x' and 'x' in PY2 mode.
I guess the next step is an experiment in mypy and typeshed. In typeshed, for PY2 we need three separate types: bytes, NativeString, Text==unicode, each incompatible with both others. (But the behavior of bytes and NativeString should be the same.) And in mypy, for PY2 str is a magical type that somehow is compatible with bytes, NativeString and Text==unicode. I'm on board with this experiment and I think it's what you propose for PY2. I don't know exactly how to code up the "str is compatible with all three" part in mypy, but I think I can figure it out.
But I'm hesitant to make NativeString a separate type in PY3 -- I think it's simpler to just make it an alias for Text==str, and I don't see much downside. My ideal is not to mess with PY3 at all (except for adding an alias).
More details for PY2 mode: the types of literals should determine the type, i.e. b'x' -> bytes; 'x' -> NativeString or Text==unicode depending on from __future__ import unicode_literals
; u'x' -> Text==unicode.
Now, conversions. Should these always be done using cast()? That's simplest to implement, but painful to write. Maybe instead of cast(NativeString, x) we should allow NativeString(x)? I guess we'll consider that as a bonus feature if these casts are common.
Finally, str() used as a function. This is all over the place and we can't tell users to change this to NativeString() or Text() -- ideally all these string types should be only used in annotations. But this would go against your recommendation: "Remove all usages of unicode and str as not PY2+3 compatible".
I also don't want to have to modify many stubs to use NativeString instead of str -- there is a trend where more and more stubs are shared between PY2 and PY3 (the 2and3 directories in typeshed) and I don't want to have to change those. So I'm still on the fence about how well all this will work out. But I'll come up with a working branch of mypy+typeshed that can do this (possibly even a mypy flag to turn this on or off).
What about a str
class in typeshed for Python 2? I think that it needs to be separate from the other 3.
Perhaps the type of str(x)
could be str
in Python 2? It would be consistent with how everything else works, and the implicit coercions would make it Just Work, hopefully (though with some loss of type safety). For the other types, users can perhaps use bytes(x)
, Text(x)
or NativeString(x)
(okay, the last one is kind of ugly).
Also, if NativeString
is needed a lot, I wonder if it should be called NativeStr
instead? It maps to str
at runtime after all, and spelling it using the short form would make that a little more explicit.
I'm fine with NativeStr
. I suppose we could have a str
type too, with identical definition as NativeStr
, and just have mypy itself take care of the aliasing. (Also don't forget basestring! I guess all string types should inherit from it, and it should continue not to have any methods of its own.)
Though there are a lot of places where it currently hardcodes builtins.str
, e.g. the type of dict(x=1)
is Dict[str, int]
where the str
is hardcoded. Perhaps related, when you call a function using f(**d)
, the type of the argument d must be a subtype of dict[str, Any]
.
Also, I'm starting to look at the necessary typeshed changes here, and I'm getting a bit worried about the practical implications here. The existing str and unicode types, as well as bytearray, actually have lots of cross-references. Some of these are expected, e.g. unicode.encode() returns str and str.decode() returns unicode. I guess in the new system, unicode.encode() should return bytes and everybody's decode() should return unicode.
But there are lots of weird cases too, e.g. encode() and decode() take an encoding name, which itself is currently declared as unicode -- it can be str or unicode, and the implicit promotion makes unicode the most convenient type to use (not AnyStr
, nor Union[str, unicode]
).
But there are more weird cases! str.count() takes a unicode arg, and so does unicode.count(), and both implicitly allow str arguments too, because of the str->unicode promotion. In fact this is pretty common, also found for endswith(), find(), index(), lstrip(), partition(), rfind(), rindex(), rpartition(), rsplit(), split(), startswith(), strip().
Finally str.__lt__
and __le__
, __gt__
, __ge__
take a unicode argument. (But __eq__
and __ne__
take an object.)
There's also some explicit support for bytearray, where partition() and rpartition() are overloaded three ways to take str, unicode and bytearray (and the argument type reoccurs in the return type).
Oh and of course there are also some uses of AnyStr, e.g. str.__add__
takes an AnyStr and returns the same. (I'm curious why there's no __radd__
?)
Sorry about the long low-level rant, I'm just concerned that if we define separate and mutually incompatible classes bytes, str, NativeStr and unicode==Text, it's hard to maintain all these and to make sure that all combinations do something reasonable. The problem is that it's not just a set of rules for "when types X and Y meet, the result is Z". There need to be separate rules for each method, or at least for each of various categories of methods. In a previous incarnation I tried to do this for just str, bytes and unicode, where str was a subclass of bytes, and even there it was complicated to know whether I was doing it right. Maybe it's a little simpler without explicit subclass relationships. But I don't know exactly how and when mypy's promotion code works. (I will study it.)
FWIW here's the tip of the iceberg:
Replied in PR https://github.com/python/typeshed/pull/580.
I'm really sorry for dropping this. I'm afraid I cannot bring up the energy to move this issue forward. It hasn't felt important recently, the necessary changes to typeshed and mypy didn't feel straightforward enough, and now I think the status quo is probably fine. At least we have ways to refer to bytes and unicode (Text) in both versions, which really helps in writing straddling code (though not necessarily in type-checking).
There's a team at Dropbox that's experimenting with straddling code, their approach is simply to run mypy twice, once in Python 2 mode and once in Python 3 mode. It seems to work well enough for them.
I don't want to close the issue -- unlike "stuck" PRs, languishing issue don't bother me quite as much. Maybe someone else wants to work on some of the tricky implementation issues, e.g. having separate bytes and str types in Python 2, using them consistently in all the other stubs, and making mypy handle the various combinations correctly. But I recommend trying to make very small steps at first -- I think I bit off too much with my various attempts at introducing a separate bytes type in typeshed. (It's amazing how subtle the reasoning is about whether something "really" takes a str or a bytes.)
It's actually pretty interesting that the simple-minded approach that mypy uses (making str
a subtype of unicode
in Python 2 mode) actually seems to have caused very little trouble for our users. Though clearly it only finds a subset of potential str
/unicode
bugs.
(making str a subtype of unicode in Python 2 mode) actually seems to have caused very little trouble for our users
It caused troubles (or at least it didn't help alleviate them) with mixing str
and unicode
in a project I'm working on, see https://github.com/python/mypy/issues/2182
I don't like the profusion of string types with odd semantics.
What about the following? Basically we define the types reflecting the actual MROs:
if PY2:
class basestring(object): ...
class str(basestring): ...
class unicode(basestring): ...
bytes = str
else:
class bytes(object): ...
class str(object): ...
Then, to deal with the implicit promotion of str
to unicode
in Python 2, we need a bunch of tables/stub files, plus an AsciiBytes
type.
E.g.
if PY2:
@overload
__add__(a: unicode, b: unicode)->unicode: ...
@overload
__add__(a: AsciiBytes, b: unicode)->unicode: ...
@overload
__add__(a: unicode, b: AsciiBytes)->unicode: ...
@overload
__add__(a: str, b: unicode)->unicode: ... # May warn if using this
@overload
__add__(a: unicode, b: str)->unicode: ... # May warn if using this
@overload
__add__(a: AsciiBytes, b: AsciiBytes)->AsciiBytes: ...
@overload
__add__(a: str, b: str)->str: ...
else:
@overload
__add__(a: str, b: str)->str: ...
@overload
__add__(a: bytes, b: bytes)->bytes: ...
As for the "text in spirit", that is just str
, is it not?
The daunting part is that Python's string types have many methods and they aren't all that regular. And the rules for what exactly to warn about are also subtle.
And no, "text in spirit" is not just str
-- it's unicode
, and some but not all uses of str
. (See https://github.com/python/typing/issues/208#issuecomment-240556782 I think.)
@gvanrossum @JukkaL Yes, having to run a type checker twice for PY2 and PY3 is not that convenient, but generally it should help during porting, especially with (type checker dependent) options for disallowing untyped stuff.
If we give up on the idea of finding implicit ASCII conversion errors in PY2 code, why don't we make bytes
and Text
incompatible in both directions and allow str
to be compatible with everything?
Having bytes
and Text
as separate types would help porting to PY2+3 to some extent (as shown on my figures above). For any "native" strings or in "complicated" situations people could use str
then. And at some point during porting they could double-check their usages of str
across their type hints to make sure that str
is used correctly in respect to PY2+3.
Current mypy's behavior (promote str
to unicode
for PY2, but not the other way round) is rather unexpected:
from typing import Any
def f_str(x):
# type: (str) -> Any
pass
def f_bytes(x):
# type: (bytes) -> Any
pass
f_str(u'unicode') # False error!
f_bytes(u'unicode') # True error
Basically I'm proposing to return to this figure and stop at it:
plus
u'foo' -> Text
'foo' -> str
b'foo' -> bytes
@gvanrossum @JukkaL Could you please take a look at my latest comment?
PyCharm is moving towards adopting typeshed and I'm a bit worried about the underspecified behaviour of type checkers for text and binary data in Python 2. I have in mind an idea of the common test suite for mypy, PyCharm, and other type checkers to ensure compatibility (beyond syntactic correctness currently checked by tests/mypy_test.py and others). This str/bytes/unicode issue is one of the problems still unresolved in PEP 484.
Sorry for the lack of response. Lots going on...
why don't we make bytes and Text incompatible in both directions and allow str to be compatible with everything?
Assuming that proposal is for Python 2, that's pretty much what I settled on as my last proposal for giving up. It's also why we changed typed_ast to distinguish between Bytes, Str and Unicode.
The problem with this is not just the question of how to define the str and bytes classes in typeshed (I started on this but found it tricky) nor how to treat them in mypy (that can be done, though I haven't looked into it) but most of all the decisions about where to use str or bytes everywhere else in typeshed. That seems a daunting task, especially since most Python documentation doesn't specify types. I suppose you can turn them all into str and rely on the "str is compatible with everything" rule, but if we do it that way we might as well define unicode = str
in __builtins__.pyi
and be done with it...
@gvanrossum Yes, I meant having incompatible bytes and Text, and str compatible with everything just for Python 2.
Since 2013 we had our repository of stubs with a PyCharm-specific legacy syntax for type hints. We solved the problem of text and binary data for Python 2 stubs (but not the ASCII conversions part) by introducing string
as an alias that is compatible with both str
and unicode
. Our practice for annotating stuff from the standard library was that we almost always used string
for the types of parameters and we used str
and unicode
where a function always returns a particular type, resorting to string
otherwise.
We could use a similar approach for typeshed where we annotate parameters with str
and try to be more specific (i.e. use Text
or bytes
) for return values when possible (and use str
when it isn't).
Basically the new proposal seems to drop NativeStr
as a separate type, but otherwise it's like the one before it? Here's a summary of how I understand it:
Python 2 type checking:
Text
or unicode
(same as now) and u'foo'
literals.bytes
and b'foo'
literals.str
and 'foo'
literals may be used.AnyStr
would range over str
, bytes
and unicode
. (The rules would likely be subtly more complex than this, but we can probably figure it out.)str
would be compatible with bytes
and unicode
. Both bytes
and unicode
would be compatible with str
. bytes
and unicode
aren't compatible with each other.
unicode
-> str
compatibility (currently it's one way only: str
-> unicode
).Python 2 stub changes:
bytes
class to Python 2. str
could be mixed in operations with bytes
, similar to how they can be mixed with unicode
already. bytes
and unicode
wouldn't mix.str
in Python 2 stubs would need to be checked, or we'd risk making the stubs less precise.
bytes
, and return types could remain as str
. This would likely not break a lot of existing code that uses mypy. Interestingly, this is pretty much the opposite of the PyCharm practice.bytes
return type.unicode
types in stubs would likely have to be changed to Union[bytes, unicode]
.Type checking Python 3 would not change.
There are plenty of str
references in Python 2 stubs:
$ ag '\bstr\b' stdlib/2 stdlib/2and3 third_party/2 third_party/2and3/|wc -l
2477
unicode
and Text
are less frequent:
$ ag '\b(Text|unicode)\b' stdlib/2 stdlib/2and3 third_party/2 third_party/2and3/|wc -l
448
@JukkaL Yes, my understanding of the proposal matches your description.
As for the proposed changes in typeshed, @JukkaL could you elaborate on your idea of using bytes
by default for parameters and attributes in Python 2 stubs?
I would say that in Python 2 it's almost always legal to pass both text and binary data to both a function that expects text and to a function that expects binary. (Assuming we won't try to catch implicit ASCII conversions). So I would do it the other way round: use str
by default for parameters and switch to bytes
or Text
if it is always an error to pass a value of the other type. What's your opinion on that?
@vlasovskikh I agree that your proposal would result in fewer false positives and is worth considering.
My idea was to generalize how mypy currently works, and to catch an interesting subset of dangerous implicit ASCII conversions. Mypy doesn't consider unicode
to be valid argument to a function expecting a str
, so changing the argument type to bytes
would change little -- unicode
would still not be accepted. Similarly, if a function currently returns str
, mypy lets you pass the return value to a function expecting unicode
. If we preserve the return value type as str
, this wouldn't change.
Mypy currently doesn't work well with unicode_literals
. If that is an important use case, your proposal is likely better. Also, mypy behaves a little arbitrarily -- it catches implicit conversions from unicode
to str
, but not vice versa, even though both can fail at runtime. The implicit hypothesis here is that code explicitly using unicode
or u'foo'
literals is likely to encounter non-ASCII values, whereas code using just plain str
is more likely to be ASCII only, and thus implicit conversions are more likely to be safe from str
than from unicode
.
If we follow your proposal, I'd expect most existing code designed for mypy to still pass type checking, but mypy would catch fewer string conversion errors. Users could switch some str
types in their code to bytes
, and change some 'foo'
literals to b'foo'
literals. This would let mypy catch additional errors, and it would likely also make their code more Python 3 compatible.
It seems that having precise (non-str
) return types in stubs would work together with imprecise argument types. We can have precise argument types OR return types, but not both, as then we'd reject too many implicit conversions.
@JukkaL So we agree on everything except for how we should change the types of arguments and return values in the typeshed stubs to reflect the proposal about separate bytes
/ str
/ Text
and the compatibility rules discussed above.
Yes, I see your point about changing str
to bytes
being compatible with how mypy works for now. My objections mostly come from the fact that it's quite unsatisfying for the user to see false positive errors. We could proceed by changing str
in arguments to bytes
and keeping str
in return values and then fix all the incoming issues about false positives. (I'm working on the common test data for typeshed for that, I'll share my prototype with the typeshed contributors in a few weeks). Another possibility is to start with str
in the arguments and fix false negatives. From our perspective, the latter is more preferable, but I don't object to the first option assuming we can easily fix typeshed stubs when we see a false positive error. What are your thoughts?
Please, please can we not mangle typeshed to work around current inadequacies in mypy. typeshed should be universal and it types need to be correct, even if they are inconvenient.
In case I am missing anything, are there any examples of standard library code involving text that cannot be correctly stubbed with the simple types str
, unicode
, bytes
and AsciiBytes
?
[ I am assuming that sys.setdefaultencoding()
is not called and that the implicit conversion AsciiBytes->unicode
is OK, but that the conversion bytes->unicode
is an error. ]
A few concrete examples might help this discussion.
Modifier the typechecker to handle different default encodings is left as an exercise :smile:
@markshannon Which current inadequacies in mypy we shouldn't work around? Being correct, whatever that means in this context, is not the only goal for stubs -- tools will also need to be able to type check effectively using the stubs without confusing users. Additionally, we have users using the current typeshed in production, and we don't want to cause them a big headache by requiring a lot of rework of existing annotations, if we can avoid that. We also need a volunteer to implement the proposal, including any necessary typeshed changes.
We gave up on the AsciiBytes
idea, in part because annotating everything that may preserve the 'asciiness' of arguments precisely would be a huge hassle.
@markshannon We are discussing the opposite: how to make text vs binary data in Python 2 compatible with mypy, PyCharm, and other tools so that typeshed is universal for everyone and developers have the tools to annotate their PY2+3 text and binary data in a compatible way.
As @JukkaL mentioned, we failed to come up with a good working proposal about catching implicit ASCII conversions, but we still see value in having PY2+3 text and binary data standardized.
It turns out we've come up with a proposal that @JukkaL and me find satisfying. The remaining question is how to migrate typeshed to these new text / binary types.
Here's one idea for migrating typeshed.
First phase:
bytes
to typeshed and update str
to mix with bytes
. Update AnyStr
.str
/ unicode
.b'foo'
literals in Python 2 mode.Second phase:
str
types in typeshed. Gradually migrate typeshed annotations to use whatever scheme we come up with.I'm happy to go along if we can work something out. I do have some questions about the actual class definitions for bytes, str and Text, and their subclassing relationships.
The following is entirely in the context of PY2 -- in PY3, we have bytes != str and str == Text (and bytes != Text follows), but in PY2 the proposal is to have bytes == str, str == Text, and bytes != Text.
This violates some transitive property, but that in itself doesn't bother me, since it's similar (at a smaller scope) to Any (e.g. int == Any, Any == str, but int != str, using '==' to spell "is compatible with in both directions").
I expect that the proposal will actually make type-checking involving Text weaker, since the following example currently fails, but should be accepted under the new rules:
a = ''
b = u''
a = b # Error here
b = a # This is OK
I guess I am okay with that since in reality it's often true that passing unicode where str is expected will work.
Some more questions: If I have something of type List[str]
it should be compatible with List[bytes]
and List[Text]
right? Because List[Any]
is compatible with List[int]
.
Re: Jukka's proposal: ISTM the four bullets in the first phase must all be closely coordinated -- after the first bullet is done (add separate bytes and str classes to typeshed and update AnyStr) there is code that will fail to type-check in the current version of mypy.
This is pretty much what I have attempted so far, and what prompted me to give up since I couldn't see how to easily change mypy according to the 3rd bullet. I will resurrect my branches and turn them into PRs so you can see.
@gvanrossum It would be logical for List[str]
to be compatible with List[bytes]
and List[Text]
, similar to how Any
behaves. If that turns out to be non-trivial to implement in mypy, for the first experiments this could be left out.
For code that only uses str
and unicode
, the proposed rules seem to be equivalent to str
being an alias for unicode
, or at least pretty close. However, once we add some bytes
types we will be able to do useful type checking.
OK. Note that mypy already distinguishes between b'x'
and 'x'
in PY2 mode -- it gives them the types builtins.bytes
and builtins.str
, respectively. It's just that currently bytes
is an alias for str
.
There's a long discussion on this topic in the mypy tracker: https://github.com/python/mypy/issues/1141
I'm surfacing it here because I can never remember whether that discussion is here, or in the typeshed repo, or in the mypy tracker.
(Adding str, bytes, unicode, Text, basestring as additional search keywords.)