scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Integrating xtractmime into Scrapy #5204

Open akshaysharmajs opened 2 years ago

akshaysharmajs commented 2 years ago

As per the discussion with @elacuesta and @Gallaecio. This PR will integrate xtractmime library into Scrapy for MIME sniffing

Fixes #2900, fixes #4240.

Changes

Behavior:

API:

Not implemented

To-do

codecov[bot] commented 2 years ago

Codecov Report

Merging #5204 (13bc149) into master (1c9d308) will increase coverage by 0.03%. The diff coverage is 94.36%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #5204 +/- ## ========================================== + Coverage 88.55% 88.59% +0.03% ========================================== Files 160 160 Lines 11607 11689 +82 Branches 1883 1905 +22 ========================================== + Hits 10279 10356 +77 - Misses 1003 1007 +4 - Partials 325 326 +1 ``` | [Files](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [scrapy/core/downloader/handlers/datauri.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9kYXRhdXJpLnB5) | `100.00% <100.00%> (+5.88%)` | :arrow_up: | | [scrapy/core/downloader/handlers/file.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9maWxlLnB5) | `100.00% <100.00%> (ø)` | | | [scrapy/core/downloader/handlers/ftp.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9mdHAucHk=) | `98.38% <100.00%> (ø)` | | | [scrapy/core/downloader/handlers/http11.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9odHRwMTEucHk=) | `93.97% <100.00%> (ø)` | | | [scrapy/core/downloader/webclient.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci93ZWJjbGllbnQucHk=) | `94.77% <100.00%> (ø)` | | | [scrapy/core/http2/stream.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvaHR0cDIvc3RyZWFtLnB5) | `91.90% <100.00%> (ø)` | | | [scrapy/extensions/httpcache.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2V4dGVuc2lvbnMvaHR0cGNhY2hlLnB5) | `95.47% <100.00%> (ø)` | | | [scrapy/http/request/form.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2h0dHAvcmVxdWVzdC9mb3JtLnB5) | `94.65% <100.00%> (-0.05%)` | :arrow_down: | | [scrapy/http/response/text.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2h0dHAvcmVzcG9uc2UvdGV4dC5weQ==) | `98.50% <100.00%> (+0.06%)` | :arrow_up: | | [scrapy/linkextractors/lxmlhtml.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2xpbmtleHRyYWN0b3JzL2x4bWxodG1sLnB5) | `96.21% <100.00%> (+0.65%)` | :arrow_up: | | ... and [4 more](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | |
akshaysharmajs commented 2 years ago

What I understand by looking into https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py, I think from_args is the main function required by other scrapy files for mime sniffing. I think calling xtractmime.extract_mime with different parameters based on what arguments are passed in from_argswill be good. I am not sure other functions in responsetypes.py are required now or not?

Also, CLASSES needs to be updated with more mime types and response classes but I am not sure what all can be added to it like application/pdf can be one.

Gallaecio commented 2 years ago

[…] I think from_args is the main function […] for mime sniffing. I think calling xtractmime.extract_mime […] in from_argswill be good.

I think so too.

I am not sure other functions in responsetypes.py are required now or not?

xtractmime will basically replace all other methods there. We will need to keep them around for backward compatibility, but I imagine that, as part of this pull request, we should have them all log a warning except for from_args.

Also, CLASSES needs to be updated with more mime types and response classes but I am not sure what all can be added to it like application/pdf can be one.

We don’t need additional classes.

Response is the right class for any binary response (e.g. PDF), and it is already used for any MIME type not mapped in CLASSES, so there’s nothing you need to do about binary MIME types.

If you can think of additional MIME types that make sense for one of the existing Response subclasses (HtmlResponse, XmlResponse, TextResponse), then please do feel free to update CLASSES accordingly.

Related to that, although not achievable simply extending CLASSES: the standard taught me that any MIME type ending in +xml is to be treated as an XML file, so maybe it would make sense to modify the class so that it uses XmlResponse when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying on CLASSES altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.

mime_type = extract_mime(…)
if is_html_mime_type(mime_type):
    return HtmlResponse
if is_xml_mime_type(mime_type):
    return XmlResponse
if (
    mime_type.startswith('text')
    or is_json_mime_type(mime_type)
    or is_javascript_mime_type(mime_type)
):
    return TextResponse
return Response
akshaysharmajs commented 2 years ago

Related to that, although not achievable simply extending CLASSES: the standard taught me that any MIME type ending in +xml is to be treated as an XML file, so maybe it would make sense to modify the class so that it uses XmlResponse when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying on CLASSES altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.

mime_type = extract_mime(…)
if is_html_mime_type(mime_type):
    return HtmlResponse
if is_xml_mime_type(mime_type):
    return XmlResponse
if (
    mime_type.startswith('text')
    or is_json_mime_type(mime_type)
    or is_javascript_mime_type(mime_type)
):
    return TextResponse
return Response

That's a great idea, I will add this functionality to xtractmime 👍🏼

akshaysharmajs commented 2 years ago

What can be the value of the supported_types parameter for extract_mime? Is that required here or not?

Gallaecio commented 2 years ago

A similar thing goes about nosniff. In the future we may want to expose a Scrapy setting to allow users to force sniffing regardless of Content-Type-Options. But since that feature is not in the current implementation, and we wouldn’t expect it to be used extensively, I think it is OK to leave that out until a time when a user requests that feature.

When I said this, I did not mean for you to remove your related code. I meant that, at some point in the future, users may ask to be able to send a custom value for this parameter to xtractmime, overriding whatever the X-Content-Type-Options says, but that as far as this pull request goes, relying on X-Content-Type-Options would be OK.

However, come to think of it, X-Content-Type-Options could be exploited to prevent the used of a specialized response class. So maybe it is better not to rely on X-Content-Type-Options for now, and maybe in the future make it possible to rely on it, opt-in, through a setting.

akshaysharmajs commented 2 years ago

I have added the pre n post xtractmime tests with expected behavior as comments. There can be more failing scenarios, if I found one I will add it later. Still, a lot of tests are failing.

akshaysharmajs commented 2 years ago

E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>

This is failing because mimetypes.MimeTypes() returning a text/xml content type instead of a application/gzip

>>> MimeTypes().guess_type("data.xml.gz")
('text/xml', 'gzip')
>>> 
akshaysharmajs commented 2 years ago

E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>

This is failing as we are not considering NULL byte anymore and xtractmime detecting b"\xfe\xff" as a text/plain instead of application/octet-stream

If you want I can update the existing comments for the tests based on updated behavior.

Gallaecio commented 2 years ago

E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>

This is failing because mimetypes.MimeTypes() returning a text/xml content type instead of a application/gzip

>>> MimeTypes().guess_type("data.xml.gz")
('text/xml', 'gzip')
>>> 

It looks like we need more complex logic than just taking the first item in the tuple that guess_type returns.

Based on https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type, I think if the second item of the tuple is not none, we should interpret the MIME type as application/<tuple second value>.

Gallaecio commented 2 years ago

E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>

This is failing as we are not considering NULL byte anymore and xtractmime detecting b"\xfe\xff" as a text/plain instead of application/octet-stream

If you want I can update the existing comments for the tests based on updated behavior.

Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See https://github.com/scrapy/scrapy/pull/5204#discussion_r679468793

akshaysharmajs commented 2 years ago

I thought the integration part would be simpler, I was wrong 😅

akshaysharmajs commented 2 years ago

Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See #5204 (comment)

Currently, I am checking the whole body for the binary bytes.

        for index in range(len(body)):
            if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]):
                contains_binary_bytes = True
                break

        if not contains_binary_bytes:
            body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")

Will it be better to just check it for body[:RESOURCE_HEADER_BUFFER_LENGTH]

akshaysharmajs commented 2 years ago

I have created a separate PR for the response class computation using mimegroups. Please review https://github.com/akshaysharmajs/scrapy/pull/2/files

Gallaecio commented 2 years ago
        for index in range(len(body)):
            if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]):
                contains_binary_bytes = True
                break

        if not contains_binary_bytes:
            body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")

Will it be better to just check it for body[:RESOURCE_HEADER_BUFFER_LENGTH]

I think so, yes. You could just set body = body[:RESOURCE_HEADER_BUFFER_LENGTH] before the code above, and work with just body in this code.

akshaysharmajs commented 2 years ago

Just out of curiosity, why some tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

Gallaecio commented 2 years ago

Just out of curiosity, why some tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

The testenv:docs entry of tox.ini does not install the same dependencies as the rest of packages. It installs based on setup.py. (I did not check if there are import failures in other jobs, but if so the reasons are probably similar)

That said, maybe the test should be modified to change this, it makes sense for the documentation job to install all deps just as other tests (in addition to documentation deps). In fact, maybe the documentation job should install extra dependencies as well.

However, I don’t recommend you spend time on changing that, since things will solve themselves once we publish xtractmime in PyPI, which we should do before merging this anyway.

Gallaecio commented 2 years ago

Are the latest test failures related to these changes?

akshaysharmajs commented 2 years ago

Are the latest test failures related to these changes?

They are related to something else.

All the tests in test_responsetypes.py are passing. But the following tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

tests (3.6.12, pinned)
tests (3.6.12, asyncio-pinned)
tests (pypy3, pypy3-pinned, 3.6-v7.2.0
Gallaecio commented 2 years ago

I’m closing and reopening the pull request to trigger new CI tests. The last run had all jobs failing but 4, with seemingly unrelated issues..

elacuesta commented 2 years ago

@akshaysharmajs Regarding the tests failing for the "pinned" environments, that's because the dependencies section for them do not inherit the main dependencies in tox.ini. Adding git+https://github.com/scrapy/xtractmime.git@binary#egg=xtractmime here should work.

akshaysharmajs commented 2 years ago
=================================== FAILURES ===================================
_______________________ FetchTest.test_redirect_default ________________________

self = <tests.test_command_fetch.FetchTest testMethod=test_redirect_default>

    @defer.inlineCallbacks
    def test_redirect_default(self):
        _, out, _ = yield self.execute([self.url('/redirect')])
>       self.assertEqual(out.strip(), b'Redirected here')

/home/runner/work/scrapy/scrapy/tests/test_command_fetch.py:20: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'Redirected here\n<memory at 0x7fd99aae3050>\n<memory at 0x7fd99aae3050>' != b'Redirected here'
___________________ ShellTest.test_response_encoding_gb18030 ___________________

self = <tests.test_command_shell.ShellTest testMethod=test_response_encoding_gb18030>

    @defer.inlineCallbacks
    def test_response_encoding_gb18030(self):
        _, out, _ = yield self.execute([self.url('/enc-gb18030'), '-c', 'response.encoding'])
>       self.assertEqual(out.strip(), b'gb18030')

/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'<memory at 0x7f4611e4f120>\ngb18030' != b'gb18030'
____________________ ShellTest.test_response_selector_html _____________________

self = <tests.test_command_shell.ShellTest testMethod=test_response_selector_html>

    @defer.inlineCallbacks
    def test_response_selector_html(self):
        xpath = 'response.xpath("//p[@class=\'one\']/text()").get()'
        _, out, _ = yield self.execute([self.url('/html'), '-c', xpath])
>       self.assertEqual(out.strip(), b'Works')

/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'<memory at 0x7f1b7de2d120>\nWorks' != b'Works'

These 3 tests are still failing.

akshaysharmajs commented 2 years ago

Should I add tests for _guess_response_type and _guess_content_type?

Gallaecio commented 2 years ago

Should I add tests for _guess_response_type and _guess_content_type?

I don’t think it is necessary as long as all their logic gets tested indirectly through other tests. It is hard to tell from the Codecov results, though, I think they won’t refresh until all tests pass.

Gallaecio commented 2 years ago

Oh, I think I know what’s happening.

@akshaysharmajs Maybe you can push a commit to xtractmime’s main branch removing that print? (no PR needed)

akshaysharmajs commented 2 years ago

Oh, I think I know what’s happening.

@akshaysharmajs Maybe you can push a commit to xtractmime’s main branch removing that print? (no PR needed)

👍🏼

akshaysharmajs commented 2 years ago

Well, now they are passing!

Gallaecio commented 2 years ago

I’ve just run into something that may be worth addressing as part of these changes: https://github.com/scrapy/scrapy/blob/624a1ff3e97e693e85546a54a7abba3d94bbbebb/scrapy/downloadermiddlewares/httpcompression.py#L71-L74

akshaysharmajs commented 2 years ago

I’ve just run into something that may be worth addressing as part of these changes:

https://github.com/scrapy/scrapy/blob/624a1ff3e97e693e85546a54a7abba3d94bbbebb/scrapy/downloadermiddlewares/httpcompression.py#L71-L74

Thanks for mentioning it. I will consider it. Though I am not getting enough time to make the changes 😅

akshaysharmajs commented 1 year ago

@Gallaecio I guess these changes are working. Please review it. Let me know if you have any concerns. (Some other files are failing the tests, Idk why)

akshaysharmajs commented 1 year ago

I think I have figured out whats failing the tests. Here, we are using 'http://www.example.com' as url which is forcing the content_types to be (b'application/x-msdos-program',) . But we should add a / in the end of all URLs, like 'http://www.example.com/'. Making this change is passing the test.

Gallaecio commented 1 year ago

I have done some refactoring, I hope that is OK.

I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.

I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge https://github.com/scrapy/xtractmime/pull/12 and release the first public version.

akshaysharmajs commented 1 year ago

I have done some refactoring, I hope that is OK.

Yeah, Thank you!

I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.

No problem, Let me know if I can help with that

I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge scrapy/xtractmime#12 and release the first public version.

That would be awesome, please keep me posted!

Gallaecio commented 1 year ago

xtractmime is now published on PyPI :tada:

akshaysharmajs commented 1 year ago

xtractmime is now published on PyPI 🎉

Thank you so much 😍

Gallaecio commented 1 year ago

Note to self: Make sure we are testing that if Content-Header reports gzip but content is plain text (e.g. b"\r\n") httpcompress does not try to decode the response, failing. Unless that is supposed to happen.

Gallaecio commented 1 year ago

I basically reverted some of the pro-backward-compatibility changes that I had previously asked Akshay to make. So these changes follow the standard almost to the letter, except for 2 exceptions (see the issue description for details).

Because the new behavior, close to the standard, changes a lot, specially a lot of cases where before you would get text-based response class and now you would get a binary response class, it would be good, as a next step, to actually test how much these changes affect real-world scenarios.

@kmike suggested off GitHub that we build a test that basically uses browser automation to target the homepage of popular domains, records the URLs downloaded when rendering each homepage, and then we see, for those URLs, where the new implementation would cause a change.

Based on the results of such a test, we can determine if we want to keep things close to the standard, whether we want to deviate an if so to what extent, and whether we want to allow some behavior changes (e.g. force body-based response class choice, or force a given response class altogether) based on user input (e.g. request meta or settings).

Gallaecio commented 4 months ago

I’ve recorded URLs using browser rendering on the home page of the top 50 domains, then downloaded them with Zyte API and checked which response class each implementation would use, and analyzed the results:

record.py ```python from collections import deque from playwright.async_api import Response as PlaywrightResponse from scrapy import Request, Spider, signals from scrapy.exceptions import DontCloseSpider class RecordSpider(Spider): name = "record" custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", } queue = deque() domains = [ "google.com", "youtube.com", "facebook.com", "pornhub.com", "xvideos.com", "twitter.com", "wikipedia.org", "instagram.com", "reddit.com", "amazon.com", "duckduckgo.com", "yahoo.com", "xnxx.com", "tiktok.com", "bing.com", "yahoo.co.jp", "weather.com", "whatsapp.com", "yandex.ru", "xhamster.com", "openai.com", "live.com", "microsoft.com", "microsoftonline.com", "linkedin.com", "quora.com", "twitch.tv", "naver.com", "netflix.com", "office.com", "vk.com", "globo.com", "aliexpress.com", "cnn.com", "zoom.us", "imdb.com", "x.com", "newyorktimes.com", "onlyfans.com", "espn.com", "amazon.co.jp", "pinterest.com", "uol.com.br", "ebay.com", "marca.com", "canva.com", "spotify.com", "bbc.com", "paypal.com", "apple.com", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super().from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle) return spider def start_requests(self): for domain in self.domains: yield Request( url=f"https://{domain}", meta={ "playwright": True, "playwright_page_event_handlers": { "response": "handle_response", }, }, ) async def handle_response(self, response: PlaywrightResponse) -> None: if ( 300 <= response.status < 400 or response.request.method != "GET" ): return self.queue.append({"url": response.url}) def parse(self, response): while self.queue: yield self.queue.pop() def spider_idle(self, spider): if self.queue: self.crawler.engine.download(Request("data:,", callback=self.dump_queue)) raise DontCloseSpider def dump_queue(self, response): while self.queue: yield self.queue.pop() ```
test.py ```python import jsonlines from scrapy import Request, Spider from scrapy.responsetypes import responsetypes from scrapy.utils.response import get_response_class class TestSpider(Spider): name = "test" custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", }, "DOWNLOADER_MIDDLEWARES": { "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000, }, "REQUEST_FINGERPRINTER_CLASS": "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter", "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", "ZYTE_API_TRANSPARENT_MODE": True, } def start_requests(self): with jsonlines.open("urls.jsonl") as reader: for item in reader: yield Request(item["url"]) def parse(self, response): headers_bytes_dict = { key: b",".join(value) for key, value in response.headers.items() } old_cls = responsetypes.from_args( headers=headers_bytes_dict, url=response.url, body=response.body, ) new_cls = get_response_class( http_headers=headers_bytes_dict, url=response.url, body=response.body, ) yield { "url": response.url, "old_cls": old_cls.__name__, "new_cls": new_cls.__name__, "changed": new_cls is not old_cls, } ```
analyze.py ```python import jsonlines def main(): binary = gif_text = gif_html = json = svg = 0 unexpected = [] with jsonlines.open("result.jsonl") as reader: for item in reader: if not item["changed"]: continue if item["url"].startswith("https://log.go.com/") and item["old_cls"] == "TextResponse" and item["new_cls"] == "Response": binary += 1 binary_example = item["url"] continue if "pagead2.googlesyndication.com" in item["url"] and item["old_cls"] == "TextResponse" and item["new_cls"] == "Response": gif_text += 1 gif_text_example = item["url"] continue if item["url"] == "https://www.marca.com/ue-cdn/services/cdn_cookie_service.html" and item["old_cls"] == "HtmlResponse" and item["new_cls"] == "Response": gif_html += 1 gif_html_example = item["url"] continue if "openaicom-api" in item["url"] and item["old_cls"] == "TextResponse" and item["new_cls"] == "JsonResponse": json += 1 json_example = item["url"] continue if ("svg" in item["url"] or "https://static.licdn.com/aero-v1" in item["url"]) and item["old_cls"] == "TextResponse" and item["new_cls"] == "XmlResponse": svg += 1 svg_example = item["url"] continue unexpected.append(item) print(f"Binary (Content-Type: application/octet-stream): TextResponse → Response: {binary} (e.g. {binary_example})") print(f"GIF: HtmlResponse → Response: {gif_html} (e.g. {gif_html_example})") print(f"GIF: TextResponse → Response: {gif_text} (e.g. {gif_text_example})") print(f"JSON: TextResponse → JsonResponse: {json} (e.g. {json_example})") print(f"SVG: TextResponse → XmlResponse: {svg} (e.g. {svg_example})") print(f"Unexpected: {len(unexpected)}") for item in unexpected: print(item) if __name__ == "__main__": main() ```
stdout ```none Binary (Content-Type: application/octet-stream): TextResponse → Response: 1 (e.g. https://log.go.com/log?appid=DTCI-ONEID-UI&client_id=ESPN-ONESITE.WEB-PROD&sdk_version=web%204.3.209&lightbox_version=4.3.209×tamp=1703838314840&action_name=event%3Aerror&info=payload-included(true)%2Cevent-payload(Session%20not%20established)&context=direct&source=espndeportes&conversation_id=4ac397fb-583f-4a4a-a18c-10ef40d283ba&trace=0%7CJIOWBVgQQGWAtKkDyIB8BDAdgeywTwFsBLALwFMAfUCaORFdAYwAtymBrAMRwCdkAQsnA4O5LNTCRYCJMFRpe5DABMAyuQDOm4nkpqAomrXyQAfTXgkBs1bUBpNWlbsOagC4Z35IA%3D%3D%3D&swid=AEE051E5-D27D-44D6-C450-CF9B07FF5CFA&anon=true) GIF: HtmlResponse → Response: 1 (e.g. https://www.marca.com/ue-cdn/services/cdn_cookie_service.html) GIF: TextResponse → Response: 19 (e.g. https://pagead2.googlesyndication.com/pagead/gen_204?id=sodar&v=44&t=2&bgai=Bsu5LWoKOZZHrM4XqbfPtv7gNAAAAADgB4AQC&bg=!nZ6lntHNAAY3kmNgF5I7ADQBe5WfOKbIZawkXli-YRHlLRndTLotIg35tDrBxDRAIB3_z-1ScB1aBl2SNhvIsx0X9hnwAgAAAJ9SAAAAA2gBB5kDDKX9jqoKyo7DRnWujuCMJcOigGqPepE_4ucKRvfzZPhlY2AtDHMghw4duel26Lk-hOwFmDGQzM3knweK8YPbqQkrDbgmBnkNvzQuyOeNdm1tj2UdaWEkoHyYgXXI1zhkWqR76QYbnPSHTohx1C1yWms9dir12QvN78KMrSIqfOXT0QnAIqgnNzvqW_iN8K7FWIdKQ9B4XVVnSAID-GkhbkOLSJQKrrqj77Vl0cwS8kDPov-MkFlaDYd0VW0whLbTttIQfXm-7weVupE2IR8l9_1uxiF_8ECLlWkibPHB8ZDFxDdA2HnnGbbVFHi0r6mJ8hyu4f9BPVMhSf0QqkQdalnOlo-0g7o2oRWORPFwkXMwMfxTxLqndYiCj1stFL5mcZ7nJqjE5X-vSJ0ok6RzXqF6Fab8-JV1iiL1vE-o2Re7fWBtNrl6A42qzFKO-aLXhBtsWBm_2bDjtBE9kJzJJNEURmOl8QQ0ZT2oV5WZYmve0w19XelZDz0joQA9OomLJ6Nj5MK4XRHh6yDkhVVDkTzmwWccDzSrJedQDJ7DOMZ0Ch02WoeWerIXIGJ_OPh1t8Yg4wsPWSkS1GvFsMl2KXSmq5iTDpQHpDVrEvm74Dtb3mWL-S51nfNtMHShUreQHIkQVT-5QhYn1YeyxUeLToq7hyFhMzUyFw6cveXg2eay50ZTy_5O3yF1KDCm8GjPhHsbRPtXw9bKyolKKSR5Jv-VYWQDs8Hn3muilx2rQVN93aXB0pHjPbgJHNRBHuoyjHVKvtcNVuSY5wHbLzHXAfetYjoydgvalpXLDOlDasxzrufPGJ-g61VsgTguyqcERqNLNwfHlncDAs5CxpzCQIfK4D0r_HpytvYBznpS3lbhXXW4oy5KhDtsx9gyni7vLQQwlSw6ou10sUPe17tcSbgJ2VPiLnoHCjOAvybNxMQpcaWdVIXss1Tc8Bu8jjR6bydhJJRhLGny6hHXfdHsB8Orm-R3gJ0CWXM1a6LguTt6NLmxl3UVvKfjN0Ylgz-LT9C8jv0s1m-T5m87fg) JSON: TextResponse → JsonResponse: 2 (e.g. https://openaicom-api-bdcpf8c6d2e9atf6.z01.azurefd.net/api/v1/blog-details?sort=-publicationDate%2C-createdAt&page%5Bsize%5D=4&include=media%2Ctopics&filter%5Bpublished%5D=true) SVG: TextResponse → XmlResponse: 213 (e.g. https://assets.espn.com/i/espnplus/espnPlusAllBlack.svg) Unexpected: 2 {'url': 'https://s3.glbimg.com/v1/AUTH_3ed1877db4dd4c6b9b8f505e9d4fab03/globoid-js/v1.10.0/globoid-js.min.js?loading-agent=global-webdeps', 'old_cls': 'TextResponse', 'new_cls': 'Response', 'changed': True} {'url': 'https://fourier.aliexpress.com/ts?url=&token=BAkJZTgpiDZVk3TCmUfKjc76EzxjVv2IXXM8s6t-hfAv8ikE86YNWPckME_EsZXA&cna=aHAVHsIO3FQCAS4GBFnpYbPW&ext=1', 'old_cls': 'TextResponse', 'new_cls': 'Response', 'changed': True} ```

Overall results seem for the better, but I have to look further into the 2 of the URLs to understand why the new implementation treats them as binary.

Gallaecio commented 4 months ago

https://github.com/scrapy/xtractmime/pull/18 fixes the first unexpected scenario.

Gallaecio commented 4 months ago

The other unexpected scenario seems to be related to website bans. On a browser the Content-Type is application/json, but Scrapy gets image/gif, in which case Response is the right class.

Gallaecio commented 4 months ago

@kmike Should we extend tests to more domains, to tests some additional scenarios?

kmike commented 4 months ago

@Gallaecio this is a great start 👍

Yes, I think we should extend the test to more domains (1000-10000), including unpopular domains.

Gallaecio commented 4 months ago

I found https://github.com/opendns/public-domain-lists and went with the combination of the 2 lists (20 000 domains), even though many of those domains are not active anymore since the repository is 10 years old.

This resulted in 448 800 URLs being analyzed. In 18 435 (4%) cases the response class was different in the new implementation.

By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse instead of TextResponse. But I don’t think numbers matter much here, it’s about deciding how we want to handle even the more rare scenarios.

We also use JsonResponse instead of TextResponse for more responses (64) with a Content-Type ending in +json, use XmlResponse instead of HtmlResponse for XHTML documents (5), and use XmlResponse instead of HtmlResponse for pure XML documents mislabeled as HTML.

The new implementation uses Response instead of TextResponse for some binary responses (125) that indicate so in their Content-Type (application/octet-stream or binary/octet-stream). Many of these are font files (e.g. TTF, WOFF).

~However, there are also cases where Response is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response is used now for a plain text file because the Content-Type is custom, and does not start with text/ or end with +json or +xml.~

~I wonder what we should do with these scenarios where Response is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas:~

Many image and video files now use Response instead of HtmlResponse or TextResponse: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.

~Some text-based media files now use Response instead of TextResponse, but we could map them to TextResponse by MIME type: M3U8 (12).~

In 12 cases of empty responses without a Content-Type, HtmlResponse and JsonResponse are replaced by TextResponse.

And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:

kmike commented 4 months ago

By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse instead of TextResponse.

This seems like a good change.

We also use JsonResponse instead of TextResponse for more responses (64) with a Content-Type ending in +json

Sounds good

use XmlResponse instead of HtmlResponse for XHTML documents (5)

It seems the largest change here is that the selector type is chaned from html to xml. According to https://lxml.de/parsing.html, it is good ("HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware."), but I don't know what it means in practice. Anyways, probably a good change.

and use XmlResponse instead of HtmlResponse for pure XML documents mislabeled as HTML.

Sounds good, although I wonder if it can break some selectors.

The new implementation uses Response instead of TextResponse for some binary responses (125) that indicate so in their Content-Type (application/octet-stream or binary/octet-stream). Many of these are font files (e.g. TTF, WOFF).

This seems fine.

However, there are also cases where Response is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response is used now for a plain text file because the Content-Type is custom, and does not start with text/ or end with +json or +xml.

I wonder what we should do with these scenarios where Response is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas: ...

We may also err on a side of returning TextResponse. Do we have a way to explicitly detect binary data now? I.e. add explicit binary detction to our code, and if there is something unknown, use TextResponse by default. But I'm not sure, it also needs some kind of experiment.

Many image and video files now use Response instead of HtmlResponse or TextResponse: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.

👍

Some text-based media files now use Response instead of TextResponse, but we could map them to TextResponse by MIME type: M3U8 (12).

Makes snese, why not.

In 12 cases of empty responses without a Content-Type, HtmlResponse and JsonResponse are replaced by TextResponse.

I think that's fine.

And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:

(I'll stop for now, answer later)

Gallaecio commented 3 months ago

I went with “Re-run xtractmime ignoring Content-Type when the result is Response” (https://github.com/scrapy/scrapy/pull/5204/commits/13bc1499ac2191d39bf9643e6c3b3770a2207240).