stitchfix / splits

A Python library for dealing with splittable files
MIT License
42 stars 10 forks source link

`.decode` in `SplitReader.read` fails for some values of `num` #23

Open andyalmandhunter opened 6 years ago

andyalmandhunter commented 6 years ago

The decode step will sometimes raise a UnicodeDecodeError, I think because it tries to decode num bytes from the file at a time, which isn't necessarily a valid utf-8 encoded string even if the full contents of the file is a valid utf-8 encoded string.

To reproduce:

This works fine:

>>> from ripley.readers import SFReader
>>> f = SFReader('prod', 'style')
2018-04-06 14:11:31,837 [INFO] wednesday.client:18 - __init__ ServiceClient for 'staunch' in environment 'prod' with base_uri 'http://staunch.vertigo.stitchfix.com'
2018-04-06 14:11:32,289 [INFO] ripley.metadata:41 - Finding metadata for prod.style?_no_partition_=y
2018-04-06 14:11:33,279 [INFO] ripley.metadata:54 - Only hive metadata found (no s3)
>>> s = f.read()
>>>

This fails:

>>> f = SFReader('prod', 'style')
2018-04-06 14:20:02,814 [INFO] wednesday.client:18 - __init__ ServiceClient for 'staunch' in environment 'prod' with base_uri 'http://staunch.vertigo.stitchfix.com'
2018-04-06 14:20:06,387 [INFO] ripley.metadata:41 - Finding metadata for prod.style?_no_partition_=y
2018-04-06 14:20:07,349 [INFO] ripley.metadata:54 - Only hive metadata found (no s3)
>>> s = ''
>>> while True:
...     s += f.read(1)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/ahunter/.virtualenvs/aa-py2/lib/python2.7/site-packages/splits/readers.py", line 62, in read
    new_data = new_data.decode('utf-8')
  File "/Users/ahunter/.virtualenvs/aa-py2/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 0: unexpected end of data
>>>
andyalmandhunter commented 6 years ago

FWIW, I think even in python 3, file.read(size) is supposed to interpret size as a number of bytes.
So as convenient as it is, I'm thinking that the decode step doesn't really belong in this method.

See https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects