downstream clients don't expect unicode

stitchfix / splits

A Python library for dealing with splittable files

MIT License

42 stars 10 forks source link

downstream clients don't expect unicode #22

Open drudd opened 6 years ago

drudd commented 6 years ago

The addition of .decode to the splits reader broke several downstream use cases (I believe all through ripley).

See: https://github.com/stitchfix/ripley/issues/83

Loads to redis that use unicodecsv + SFReader https://aa-jenkins.vertigo.stitchfix.com/job/orbiter-etl--0.2.22--load/296/console https://github.com/stitchfix/orbiter/blob/master/orbiter/lib/redis_cache_loader.py

Adding the issue here as well in case there are other non-ripley issues.

andyalmandhunter commented 6 years ago

I think this is more than an issue of what downstream clients expect. The addition of .decode breaks SplitReader.read() if you pass a num, because a subset of bytes from a valid utf-8 encoded string is not necessarily itself a valid utf-8 encoded string.

drudd commented 6 years ago

I think the return is always in units of num characters, whatever the encoding. https://github.com/stitchfix/splits/blob/master/splits/readers.py#L65

I'm not sure if num is supposed to refer to bytes...

drudd commented 6 years ago

I don't think the problem is returning a subset of bytes that are not a valid utf-8 string, but rather we read is now returning in units of characters and not bytes.

read definition: https://docs.python.org/2/library/stdtypes.html#file.read

andyalmandhunter commented 6 years ago

It looks like SplitReader.read calls file.read asking for num bytes, and then tries to decode those bytes. It also tries to do this repeatedly until it has num characters to return, but on each iteration .decode will be passed num - len(val) bytes, which isn't guaranteed to work.

Maybe I should make a new issue, because as you noted there are also issues caused downstream by the return type.

andyalmandhunter commented 6 years ago

See #23