rpm-software-management / urlgrabber

GNU Lesser General Public License v2.1
14 stars 23 forks source link

Use binary mode when reopening files #32

Closed meaksh closed 2 years ago

meaksh commented 2 years ago

This PR fixes an issue in version 4.1 when interacting directly with the PyCurlFileObject object:

Traceback (most recent call last):
  File "test_urlgrabber.py", line 4, in <module>
    pycurl_obj.fo.read()
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 285: invalid start byte

Apparently the re opening mechanism for the downloaded file is not using binary mode as it is expected.

You can easily reproduce this by running this example:

import urlgrabber
url = "https://www.google.com/index.html"
pycurl_obj = urlgrabber.grabber.PyCurlFileObject(url, "index.html", urlgrabber.grabber.URLGrabberOptions())
pycurl_obj.fo.read()

This PR fixes this issue and allow manual PyCurlFileObject interaction to work fine.

sbluhm commented 2 years ago

@Conan-Kudo @m-blaha , any chance for either one of you to review/merge and release?

james-antill commented 2 years ago

I think this is fine ... the writes are in binary mode, which implies this read should be too. I'd assume the example given would work better as is, maybe it does in py2?

As always with python there's always a worry that someone else's code now breaks after changing it.