checking for basestring makes more flexible

erikbgithub commented 9 years ago

In Python2 it's sadly possible to have a string which is not of exactly type str, e.g., unicode strings. For the isinstance() checks it doesn't matter which type of string it is, as long as it is one kind of them. To check for all kinds of strings, it is better to check for basestring, because that works for all string types.

superbobry commented 9 years ago

Hello, I'm not sure allowing both unicode and bytes is a good idea. Stream was designed to be used with text, thus it forces unicode (or str in Python 3) for input type.

Can you provide an example of when this behaviour might be convenient (and safe)?

erikbgithub commented 9 years ago

I have to tell you I didn't investigate in depth what I was actually feeding to my Stream object before making that change. It just worked with that change. Now that I look into it, I'm actually feeding Stream with a str object but still get the exception.

I can't tell you why, but the following code called with pyte==0.4.8 (Python 2.7) raises an Exception in stream.py in line 165:

import pyte
s=pyte.Stream()
s.feed(str('hi')) #same without str() just want to show I really just got a str here

Exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/test/venv/local/lib/python2.7/site-packages/pyte/streams.py", line 165, in feed
    raise TypeError("%s requires str input" % self.__class__.__name__)
TypeError: Stream requires str input

If I just check by myself something like isinstance('',str) it's always true. I have no idea why it is a difference when using feed(). Looking into the code there is nothing that would change chars.

With the suggested patch it works, though. From a little googling I think that Python3's str is equivalent to Python2's basestring when checking for types?

jquast commented 9 years ago

False -- python2's str and python3's basestring are not equivalent. I've read that SO answer and it's wrong in these terms, it doesn't seem applicable here -- in py3 str means only unicode, never bytes. Whereas in py2, basestring could be either unicode or bytes.

In pexpect, we take the approach of allowing input as bytes or unicode, and output as bytes only for the backwards-compatible "spawn" class, and only input and output as unicode for the newly added unicode-only "spawnu" class.

You can see similar code in the class variable definitions wrapped by if PY3 of SpawnBase and SpawnBaseUnicode:

https://github.com/pexpect/pexpect/blob/master/pexpect/spawnbase.py#L447-454 https://github.com/pexpect/pexpect/blob/master/pexpect/spawnbase.py#L16-32

Python 2.7.8

>>> isinstance(b'xyz', basestring)
True
>>> isinstance(b'xyz', str)
True
>>> isinstance(u'xyz', basestring)
True

Python 3.4.1

>>> isinstance(b'xyz', str)
False
>>> isinstance(b'xyz', bytes)
True
>>> isinstance(u'xyz', str)
True

erikbgithub commented 9 years ago

@jquast Okay, but can you understand why Stream.feed() excepts on receiving a string although it just checks for that? Do you get the same error?

superbobry commented 9 years ago

The message in the raised exception is slightly misleading.

pyte aliases str to unicode for Python2. So the Stream#feed method really checks its input against unicode not __builtins__.str.

erikbgithub commented 9 years ago

Then how about checking for basestring and converting any (other) type to unicode? (see updated pull request 09187b6)

I don't want to stream.feed(unicode('hi')). That's just so unpythonic.

superbobry commented 9 years ago

The ultimate solution is to explicitly decode text from bytes before calling feed. Calling unicode might not always work because it implicitly uses the codec for the default encoding (see sys.getdefaultencoding).

>>> b = u"привет".encode("utf-8")
>>> b
'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
>>> unicode(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

erikbgithub commented 9 years ago

You have shown a negative example, but I'm still not getting the positive one. How would you handle that string to make sure it has the right format?

superbobry commented 9 years ago

The problem in the above example is that unicode uses "ascii", which is the default encoding for Linux, to decode b. To fix this behaviour we need to pass the encoding argument explicitly:

>>> b = u"привет".encode("utf-8")
>>> b
'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
>>> print(unicode(b, encoding="utf-8"))
привет
>>> print(b.decode("utf-8"))  # equiv. to calling unicode
привет

erikbgithub commented 9 years ago

So what you say is, that you can't cast any basestring to unicode because you can't know the encoding of the incoming basestring? Then I have to agree, checking for the resulting type might be the only way to go.

superbobry commented 9 years ago

basestring is not a type it's a union of unicode and str (aka bytes). For str your statements holds: you cannot automagically decode text from an arbitrary sequence of bytes. You need to know the encoding.

erikbgithub commented 9 years ago

I've read a lot about unicode now and conclude with closing this pull request and adding the following code to my Python2 project. I suggest, to update the error message for Python2, though, because saying that you require a (python2) str is actually a real bug, right?

>>> from __future__ import unicode_literals

Result:

# before
>>> import pyte
>>> s=pyte.Stream()
>>> s.feed("hi")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/test/venv/local/lib/python2.7/site-packages/pyte/streams.py", line 165, in feed
    raise TypeError("%s requires str input" % self.__class__.__name__)
TypeError: Stream requires str input

#after
>>> from __future__ import unicode_literals
>>> s.feed("hi")
>>>

superbobry commented 9 years ago

Yep, unicode_literals should promote all string literals to be unicode.

I suggest, to update the error message for Python2, though, because saying that you require a (python2) str is actually a real bug, right?

Sure, thank you.

erikbgithub commented 9 years ago

Oh I just wanted to start on the error messages, but I see, you've already worked on that (139c2275874e982e41201). Great!

selectel / pyte

checking for basestring makes more flexible #34