Closed erikbgithub closed 9 years ago
Hello, I'm not sure allowing both unicode
and bytes
is a good idea. Stream
was designed to be used with text, thus it forces unicode
(or str
in Python 3) for input type.
Can you provide an example of when this behaviour might be convenient (and safe)?
I have to tell you I didn't investigate in depth what I was actually feeding to my Stream object before making that change. It just worked with that change. Now that I look into it, I'm actually feeding Stream with a str
object but still get the exception.
I can't tell you why, but the following code called with pyte==0.4.8 (Python 2.7) raises an Exception in stream.py in line 165:
import pyte
s=pyte.Stream()
s.feed(str('hi')) #same without str() just want to show I really just got a str here
Exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/test/venv/local/lib/python2.7/site-packages/pyte/streams.py", line 165, in feed
raise TypeError("%s requires str input" % self.__class__.__name__)
TypeError: Stream requires str input
If I just check by myself something like isinstance('',str)
it's always true. I have no idea why it is a difference when using feed()
. Looking into the code there is nothing that would change chars
.
With the suggested patch it works, though. From a little googling I think that Python3's str
is equivalent to Python2's basestring
when checking for types?
False -- python2's str
and python3's basestring
are not equivalent. I've read that SO answer and it's wrong in these terms, it doesn't seem applicable here -- in py3 str
means only unicode, never bytes. Whereas in py2, basestring
could be either unicode
or bytes
.
In pexpect, we take the approach of allowing input as bytes or unicode, and output as bytes only for the backwards-compatible "spawn" class, and only input and output as unicode for the newly added unicode-only "spawnu" class.
You can see similar code in the class variable definitions wrapped by if PY3
of SpawnBase and SpawnBaseUnicode:
https://github.com/pexpect/pexpect/blob/master/pexpect/spawnbase.py#L447-454 https://github.com/pexpect/pexpect/blob/master/pexpect/spawnbase.py#L16-32
Python 2.7.8
>>> isinstance(b'xyz', basestring)
True
>>> isinstance(b'xyz', str)
True
>>> isinstance(u'xyz', basestring)
True
Python 3.4.1
>>> isinstance(b'xyz', str)
False
>>> isinstance(b'xyz', bytes)
True
>>> isinstance(u'xyz', str)
True
@jquast Okay, but can you understand why Stream.feed() excepts on receiving a string although it just checks for that? Do you get the same error?
The message in the raised exception is slightly misleading.
pyte
aliases str
to unicode
for Python2. So the Stream#feed
method really checks its input against unicode
not __builtins__.str
.
Then how about checking for basestring and converting any (other) type to unicode? (see updated pull request 09187b6)
I don't want to stream.feed(unicode('hi'))
. That's just so unpythonic.
The ultimate solution is to explicitly decode text from bytes before calling feed
. Calling unicode
might not always work because it implicitly uses the codec for the default encoding (see sys.getdefaultencoding
).
>>> b = u"привет".encode("utf-8")
>>> b
'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
>>> unicode(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
You have shown a negative example, but I'm still not getting the positive one. How would you handle that string to make sure it has the right format?
The problem in the above example is that unicode
uses "ascii"
, which is the default encoding for Linux, to decode b
. To fix this behaviour we need to pass the encoding
argument explicitly:
>>> b = u"привет".encode("utf-8")
>>> b
'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
>>> print(unicode(b, encoding="utf-8"))
привет
>>> print(b.decode("utf-8")) # equiv. to calling unicode
привет
So what you say is, that you can't cast any basestring
to unicode
because you can't know the encoding of the incoming basestring
? Then I have to agree, checking for the resulting type might be the only way to go.
basestring
is not a type it's a union of unicode
and str
(aka bytes
). For str
your statements holds: you cannot automagically decode text from an arbitrary sequence of bytes. You need to know the encoding.
I've read a lot about unicode now and conclude with closing this pull request and adding the following code to my Python2 project. I suggest, to update the error message for Python2, though, because saying that you require a (python2) str
is actually a real bug, right?
>>> from __future__ import unicode_literals
Result:
# before
>>> import pyte
>>> s=pyte.Stream()
>>> s.feed("hi")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/test/venv/local/lib/python2.7/site-packages/pyte/streams.py", line 165, in feed
raise TypeError("%s requires str input" % self.__class__.__name__)
TypeError: Stream requires str input
#after
>>> from __future__ import unicode_literals
>>> s.feed("hi")
>>>
Yep, unicode_literals
should promote all string literals to be unicode
.
I suggest, to update the error message for Python2, though, because saying that you require a (python2) str is actually a real bug, right?
Sure, thank you.
Oh I just wanted to start on the error messages, but I see, you've already worked on that (139c2275874e982e41201). Great!
In Python2 it's sadly possible to have a string which is not of exactly type
str
, e.g.,unicode
strings. For theisinstance()
checks it doesn't matter which type of string it is, as long as it is one kind of them. To check for all kinds of strings, it is better to check forbasestring
, because that works for all string types.