scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

Different behavior of `get_header` between urllib.Request and WrappedRequest #6308

Closed marinelay closed 4 weeks ago

marinelay commented 4 weeks ago

Description

I believe WrappedRequest is a class for supporting methods in urllib.Request, but I found a behavior of get_header method in WrappedRequest is different from urllib.Requests. When trying to get a header which is not present without default value, it should return None value (https://docs.python.org/3/library/urllib.request.html#urllib.request.Request.get_header), but get_header in WrappedRequest raises TypeError: to_unicode must receive a bytes or str object, got NoneType.

Steps to Reproduce

from urllib.request import Request as _Request
from scrapy.http.request import Request
from scrapy.http.cookies import WrappedRequest

a = _Request(url="https://a.example")
print(a.get_header('xxxx'))

b = WrappedRequest(Request(url="https://a.example"))
print(b.get_header('xxxx'))

Expected behavior:

None
None

Actual behavior:

None
TypeError: to_unicode must receive a bytes or str object, got NoneType

Additional context

This issue is currently not a problem when interacting with the CookieJar class. Thus, it is not an urgent matter unless one is importing and using WrappedRequest. However, I think it would enhance the reliability in this project. Thank you!

siddharth07-ui commented 4 weeks ago

Hi @marinelay and @wRAR, while looking through this good first issue, I found the line in scrapy, which seems to be the root cause of providing this error for Request with WrappedRequest as shown in the line --

===================================================================== from urllib.request import Request as _Request from scrapy.http.request import Request from scrapy.http.cookies import WrappedRequest

a = _Request(url="https://a.example") print(a.get_header('xxxx'))

b = WrappedRequest(Request(url="https://a.example")) print("WrappedRequest get-header result:", b.get_header('xxxx')) -- This one

=====================================================================

None TypeError: to_unicode must receive a bytes or str object, got NoneType -- Result

=====================================================================

Where as per behavior, it should be returning None for both the requests.

This is line 173 in "http/response/cookies.py" --

return to_unicode(self.request.headers.get(name, default), errors="replace")

Because of "to_unicode" being used, which as per function definition says -- """Return the unicode representation of a bytes object text. If text is already an unicode object, return it as-is."""

Here is checks output of 'self.request.headers.get(name, default), errors="replace"', which in this case would be "str" to be a valid candidate. If this is not the case, hence the error - "TypeError: to_unicode must receive a bytes or str object, got NoneType".

Hence a viable solution to this can be - "return self.request.headers.get(name, default)", which returns the output as "None".

scrapy1

Do let me know if this is the correct solution to this, or it might interfere with some other functionality for it.

wRAR commented 4 weeks ago

Removing the to_unicode() call is obviously incorrect.

marinelay commented 4 weeks ago

Then how about exception handling code such like this?

def get_header(self, name, default=None):
    try:
        return to_unicode(self.request.headers.get(name, default), errors="replace")
    except TypeError:
        return default
siddharth07-ui commented 4 weeks ago

Hi @marinelay, I guess that gives the required output. Since it usually throws a 'TypeError', handling this part using an exception block is neat 👍.