Open drudd opened 6 years ago
I think this is more than an issue of what downstream clients expect. The addition of .decode
breaks SplitReader.read()
if you pass a num
, because a subset of bytes from a valid utf-8 encoded string is not necessarily itself a valid utf-8 encoded string.
I think the return is always in units of num
characters, whatever the encoding.
https://github.com/stitchfix/splits/blob/master/splits/readers.py#L65
I'm not sure if num
is supposed to refer to bytes...
I don't think the problem is returning a subset of bytes that are not a valid utf-8 string, but rather we read
is now returning in units of characters and not bytes.
read
definition:
https://docs.python.org/2/library/stdtypes.html#file.read
It looks like SplitReader.read
calls file.read
asking for num
bytes, and then tries to decode those bytes. It also tries to do this repeatedly until it has num
characters to return, but on each iteration .decode
will be passed num - len(val)
bytes, which isn't guaranteed to work.
Maybe I should make a new issue, because as you noted there are also issues caused downstream by the return type.
See #23
The addition of
.decode
to the splits reader broke several downstream use cases (I believe all through ripley).See: https://github.com/stitchfix/ripley/issues/83
Loads to redis that use unicodecsv + SFReader https://aa-jenkins.vertigo.stitchfix.com/job/orbiter-etl--0.2.22--load/296/console https://github.com/stitchfix/orbiter/blob/master/orbiter/lib/redis_cache_loader.py
Adding the issue here as well in case there are other non-ripley issues.