akshaysharmajs commented 2 years ago

As per the discussion with @elacuesta and @Gallaecio. This PR will integrate xtractmime library into Scrapy for MIME sniffing

Fixes #2900, fixes #4240.

Changes

Behavior:

An empty body with no indication of a type through headers or file extensions now maps to TextResponse instead of to Response.
A response with Content-Encoding is now always Response (until uncompressed, when the header is removed and the response class mapping re-evaluated). Content-Disposition, URL, and body can no longer result in a different response class.
The file extension is no longer taken into account for HTTP requests, only for the file system and for FTP requests.
- An exception not contemplated in the MIME sniffing standard is made for the Content-Disposition header: as long as Content-Type is not defined or has a value equivalent to not being defined (e.g. */*), Content-Disposition is taken into account, i.e. the MIME type matching the file extension is considered the supplied MIME type, as if it had been defined in the Content-Type header. This is because in Scrapy we treat those responses the same as other responses (i.e. we do not download them in file pipelines automatically, we handle them in callbacks just like regular responses), so it seems appropriate to consider the server-suggested file extension to determine the right response class.
The supplied MIME type (i.e. Content-Type in HTTP or based on file extension in filesystem or FTP) now takes priority over body sniffing even if it maps to Response where the body would map to something like TextResponse. Before, the first things in the (headers, url, filename, body) that would map to something other than Response would dictate the selected class. So, for example, if the header or URL indicated a binary response but the body was plain text, the result was TextResponse.
- An exception is made for supplied plain-text MIME types related to an Apache bug, for which the body is inspected in case the response is binary, and if it is Response is used instead of TextResponse.
- Another exception is made for text/html, for which the body is inspected in case it is actually a news feed file, and if so XmlResponse is used.
Response class mapping based on the supplied MIME type has been changed as follows:
- application/xhtml+xml and application/vnd.wap.xhtml+xml now map to XmlResponse instead of to HtmlResponse. See “Note that XHTML is best parsed as XML” @ https://lxml.de/parsing.html
- application/ecmascript and application/x-ecmascript now map to TextResponse instead of Response.
- MIME types ending in +json (e.g. application/foo+json, application/ld+json) now always map to TextResponse.
- MIME types ending in +xml (e.g. application/foo+xml) now always map to TextResponse.
- The following MIME types not contemplated in the MIME sniffing standard have been kept, mapped to JsonResponse or TextResponse, as keeping them seems harmless: application/x-json, application/json-amazonui-streaming, application/x-javascript. However, we may want to consider removing support for them to be more in line with the standard, since TextResponse should get selected anyway based on the body.
Note that the end result for the new mappings above that used to point to Response or that did not exist at all, the actual response class used will generally not have changed, because for old Response mappings body-based detection used to always run, in case the body content could suggest other than Response. The advantage now is that we skip body-based detection altogether for those new mappings.
The body is now only taken into account when there is no supplied MIME type otherwise, or in the 2 exceptions mentioned above (Apache bug and mislabeled feeds).
Body-based detection has improved for corner cases, including:
- \0 is now considered binary. \x0c and \x1b are no longer considered binary.
- Many more corner cases where HTML is now detected.
- Many cases where binary files are now detected as such based on their file signature where before they could be misinterpreted as text if they happened not to have any binary bytes.
Body-based detection now inspects the first 1445 bytes, not the first 5000.
For data URIs, in addition to the declared MIME type, the body is now also taken into account.
When faced with a multi-encoded response (e.g. Content-Encoding: br, gzip) the HTTP compression middleware now applies all decoding, instead of applying only the last one. This is an unrelated change that I (@Gallaecio) deemed worth applying when I realized the issue when working on relevant changes. We could move it into its own pull request, though, if desired.

API:

The scrapy.responsetypes module is deprecated in favor of the new scrapy.utils.response.get_response_class function
The scrapy.utils.response.get_base_url function is deprecated in favor of the new TextResponse.base_url property I do not recall the reason behind this change, it seems unrelated to this pull request, so we could move it into its own pull request if desired.

Not implemented

No option to disable sniffing (i.e. body-based type detection for scriptable content) has been implemented:
- The option to disable sniffing on the Scrapy side is not implemented. It could be implemented in the future through a setting.
- X-Content-Type-Options from a web server is not taken into account. We could implement it in the future. Note that it is not part of the MIME sniffing standard itself, but the result of its handling can be fed into the MIME sniffing standard algorithm, which accepts a nosniff flag.
The MIME sniffing standard allows for the extension of body-based detection rules, as long as those additional rules are not for MIME types with rules already defined in the standard. We could eventually expose this through a setting, to let Scrapy users define their own rules. Or we could expose custom rules, e.g. using a magic-based library. However, since the MIME types for which we have specific classes are already covered, I do not expect much demand for such a feature at the moment.

To-do

[x] Upgrade the minimum version of xtractmime
- [x] Release a new version of xtractmime with https://github.com/scrapy/xtractmime/pull/14
- [x] Update setup.py and tox.ini here accordingly.

codecov[bot] commented 2 years ago

Codecov Report

Merging #5204 (13bc149) into master (1c9d308) will increase coverage by 0.03%. The diff coverage is 94.36%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #5204 +/- ## ========================================== + Coverage 88.55% 88.59% +0.03% ========================================== Files 160 160 Lines 11607 11689 +82 Branches 1883 1905 +22 ========================================== + Hits 10279 10356 +77 - Misses 1003 1007 +4 - Partials 325 326 +1 ``` | [Files](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [scrapy/core/downloader/handlers/datauri.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9kYXRhdXJpLnB5) | `100.00% <100.00%> (+5.88%)` | :arrow_up: | | [scrapy/core/downloader/handlers/file.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9maWxlLnB5) | `100.00% <100.00%> (ø)` | | | [scrapy/core/downloader/handlers/ftp.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9mdHAucHk=) | `98.38% <100.00%> (ø)` | | | [scrapy/core/downloader/handlers/http11.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9oYW5kbGVycy9odHRwMTEucHk=) | `93.97% <100.00%> (ø)` | | | [scrapy/core/downloader/webclient.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci93ZWJjbGllbnQucHk=) | `94.77% <100.00%> (ø)` | | | [scrapy/core/http2/stream.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvaHR0cDIvc3RyZWFtLnB5) | `91.90% <100.00%> (ø)` | | | [scrapy/extensions/httpcache.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2V4dGVuc2lvbnMvaHR0cGNhY2hlLnB5) | `95.47% <100.00%> (ø)` | | | [scrapy/http/request/form.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2h0dHAvcmVxdWVzdC9mb3JtLnB5) | `94.65% <100.00%> (-0.05%)` | :arrow_down: | | [scrapy/http/response/text.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2h0dHAvcmVzcG9uc2UvdGV4dC5weQ==) | `98.50% <100.00%> (+0.06%)` | :arrow_up: | | [scrapy/linkextractors/lxmlhtml.py](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2xpbmtleHRyYWN0b3JzL2x4bWxodG1sLnB5) | `96.21% <100.00%> (+0.65%)` | :arrow_up: | | ... and [4 more](https://app.codecov.io/gh/scrapy/scrapy/pull/5204?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | |

akshaysharmajs commented 2 years ago

What I understand by looking into https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py, I think from_args is the main function required by other scrapy files for mime sniffing. I think calling xtractmime.extract_mime with different parameters based on what arguments are passed in from_argswill be good. I am not sure other functions in responsetypes.py are required now or not?

Also, CLASSES needs to be updated with more mime types and response classes but I am not sure what all can be added to it like application/pdf can be one.

Gallaecio commented 2 years ago

[…] I think from_args is the main function […] for mime sniffing. I think calling xtractmime.extract_mime […] in from_argswill be good.

I think so too.

I am not sure other functions in responsetypes.py are required now or not?

xtractmime will basically replace all other methods there. We will need to keep them around for backward compatibility, but I imagine that, as part of this pull request, we should have them all log a warning except for from_args.

Also, CLASSES needs to be updated with more mime types and response classes but I am not sure what all can be added to it like application/pdf can be one.

We don’t need additional classes.

Response is the right class for any binary response (e.g. PDF), and it is already used for any MIME type not mapped in CLASSES, so there’s nothing you need to do about binary MIME types.

If you can think of additional MIME types that make sense for one of the existing Response subclasses (HtmlResponse, XmlResponse, TextResponse), then please do feel free to update CLASSES accordingly.

Related to that, although not achievable simply extending CLASSES: the standard taught me that any MIME type ending in +xml is to be treated as an XML file, so maybe it would make sense to modify the class so that it uses XmlResponse when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying on CLASSES altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.

mime_type = extract_mime(…)
if is_html_mime_type(mime_type):
    return HtmlResponse
if is_xml_mime_type(mime_type):
    return XmlResponse
if (
    mime_type.startswith('text')
    or is_json_mime_type(mime_type)
    or is_javascript_mime_type(mime_type)
):
    return TextResponse
return Response

akshaysharmajs commented 2 years ago

Related to that, although not achievable simply extending CLASSES: the standard taught me that any MIME type ending in +xml is to be treated as an XML file, so maybe it would make sense to modify the class so that it uses XmlResponse when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying on CLASSES altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.
mime_type = extract_mime(…)
if is_html_mime_type(mime_type):
    return HtmlResponse
if is_xml_mime_type(mime_type):
    return XmlResponse
if (
    mime_type.startswith('text')
    or is_json_mime_type(mime_type)
    or is_javascript_mime_type(mime_type)
):
    return TextResponse
return Response

That's a great idea, I will add this functionality to xtractmime 👍🏼

akshaysharmajs commented 2 years ago

What can be the value of the supported_types parameter for extract_mime? Is that required here or not?

Gallaecio commented 2 years ago

A similar thing goes about nosniff. In the future we may want to expose a Scrapy setting to allow users to force sniffing regardless of Content-Type-Options. But since that feature is not in the current implementation, and we wouldn’t expect it to be used extensively, I think it is OK to leave that out until a time when a user requests that feature.

When I said this, I did not mean for you to remove your related code. I meant that, at some point in the future, users may ask to be able to send a custom value for this parameter to xtractmime, overriding whatever the X-Content-Type-Options says, but that as far as this pull request goes, relying on X-Content-Type-Options would be OK.

However, come to think of it, X-Content-Type-Options could be exploited to prevent the used of a specialized response class. So maybe it is better not to rely on X-Content-Type-Options for now, and maybe in the future make it possible to rely on it, opt-in, through a setting.

akshaysharmajs commented 2 years ago

I have added the pre n post xtractmime tests with expected behavior as comments. There can be more failing scenarios, if I found one I will add it later. Still, a lot of tests are failing.

akshaysharmajs commented 2 years ago

E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>

This is failing because mimetypes.MimeTypes() returning a text/xml content type instead of a application/gzip

>>> MimeTypes().guess_type("data.xml.gz")
('text/xml', 'gzip')
>>>

akshaysharmajs commented 2 years ago

E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>

This is failing as we are not considering NULL byte anymore and xtractmime detecting b"\xfe\xff" as a text/plain instead of application/octet-stream

If you want I can update the existing comments for the tests based on updated behavior.

Gallaecio commented 2 years ago

E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>

This is failing because mimetypes.MimeTypes() returning a text/xml content type instead of a application/gzip
>>> MimeTypes().guess_type("data.xml.gz")
('text/xml', 'gzip')
>>> 

It looks like we need more complex logic than just taking the first item in the tuple that guess_type returns.

Based on https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type, I think if the second item of the tuple is not none, we should interpret the MIME type as application/<tuple second value>.

Gallaecio commented 2 years ago

E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>

This is failing as we are not considering NULL byte anymore and xtractmime detecting b"\xfe\xff" as a text/plain instead of application/octet-stream

If you want I can update the existing comments for the tests based on updated behavior.

Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See https://github.com/scrapy/scrapy/pull/5204#discussion_r679468793

akshaysharmajs commented 2 years ago

I thought the integration part would be simpler, I was wrong 😅

akshaysharmajs commented 2 years ago

Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See #5204 (comment)

Currently, I am checking the whole body for the binary bytes.

        for index in range(len(body)):
            if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]):
                contains_binary_bytes = True
                break

        if not contains_binary_bytes:
            body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")

Will it be better to just check it for body[:RESOURCE_HEADER_BUFFER_LENGTH]

akshaysharmajs commented 2 years ago

I have created a separate PR for the response class computation using mimegroups. Please review https://github.com/akshaysharmajs/scrapy/pull/2/files

Gallaecio commented 2 years ago

        for index in range(len(body)):
            if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]):
                contains_binary_bytes = True
                break

        if not contains_binary_bytes:
            body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")

Will it be better to just check it for body[:RESOURCE_HEADER_BUFFER_LENGTH]

I think so, yes. You could just set body = body[:RESOURCE_HEADER_BUFFER_LENGTH] before the code above, and work with just body in this code.

akshaysharmajs commented 2 years ago

Just out of curiosity, why some tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

Gallaecio commented 2 years ago

Just out of curiosity, why some tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

The testenv:docs entry of tox.ini does not install the same dependencies as the rest of packages. It installs based on setup.py. (I did not check if there are import failures in other jobs, but if so the reasons are probably similar)

That said, maybe the test should be modified to change this, it makes sense for the documentation job to install all deps just as other tests (in addition to documentation deps). In fact, maybe the documentation job should install extra dependencies as well.

However, I don’t recommend you spend time on changing that, since things will solve themselves once we publish xtractmime in PyPI, which we should do before merging this anyway.

Gallaecio commented 2 years ago

Are the latest test failures related to these changes?

akshaysharmajs commented 2 years ago

Are the latest test failures related to these changes?

They are related to something else.

All the tests in test_responsetypes.py are passing. But the following tests are giving this error ModuleNotFoundError: No module named 'xtractmime'

tests (3.6.12, pinned)
tests (3.6.12, asyncio-pinned)
tests (pypy3, pypy3-pinned, 3.6-v7.2.0

Gallaecio commented 2 years ago

I’m closing and reopening the pull request to trigger new CI tests. The last run had all jobs failing but 4, with seemingly unrelated issues..

elacuesta commented 2 years ago

@akshaysharmajs Regarding the tests failing for the "pinned" environments, that's because the dependencies section for them do not inherit the main dependencies in tox.ini. Adding git+https://github.com/scrapy/xtractmime.git@binary#egg=xtractmime here should work.

akshaysharmajs commented 2 years ago

=================================== FAILURES ===================================
_______________________ FetchTest.test_redirect_default ________________________

self = <tests.test_command_fetch.FetchTest testMethod=test_redirect_default>

    @defer.inlineCallbacks
    def test_redirect_default(self):
        _, out, _ = yield self.execute([self.url('/redirect')])
>       self.assertEqual(out.strip(), b'Redirected here')

/home/runner/work/scrapy/scrapy/tests/test_command_fetch.py:20: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'Redirected here\n<memory at 0x7fd99aae3050>\n<memory at 0x7fd99aae3050>' != b'Redirected here'
___________________ ShellTest.test_response_encoding_gb18030 ___________________

self = <tests.test_command_shell.ShellTest testMethod=test_response_encoding_gb18030>

    @defer.inlineCallbacks
    def test_response_encoding_gb18030(self):
        _, out, _ = yield self.execute([self.url('/enc-gb18030'), '-c', 'response.encoding'])
>       self.assertEqual(out.strip(), b'gb18030')

/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'<memory at 0x7f4611e4f120>\ngb18030' != b'gb18030'
____________________ ShellTest.test_response_selector_html _____________________

self = <tests.test_command_shell.ShellTest testMethod=test_response_selector_html>

    @defer.inlineCallbacks
    def test_response_selector_html(self):
        xpath = 'response.xpath("//p[@class=\'one\']/text()").get()'
        _, out, _ = yield self.execute([self.url('/html'), '-c', xpath])
>       self.assertEqual(out.strip(), b'Works')

/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
    super().assertEqual(first, second, msg)
E   twisted.trial.unittest.FailTest: b'<memory at 0x7f1b7de2d120>\nWorks' != b'Works'

These 3 tests are still failing.

akshaysharmajs commented 2 years ago

Should I add tests for _guess_response_type and _guess_content_type?

Gallaecio commented 2 years ago

Should I add tests for _guess_response_type and _guess_content_type?

I don’t think it is necessary as long as all their logic gets tested indirectly through other tests. It is hard to tell from the Codecov results, though, I think they won’t refresh until all tests pass.

Gallaecio commented 2 years ago

Oh, I think I know what’s happening.

@akshaysharmajs Maybe you can push a commit to xtractmime’s main branch removing that print? (no PR needed)

akshaysharmajs commented 2 years ago

Oh, I think I know what’s happening.

@akshaysharmajs Maybe you can push a commit to xtractmime’s main branch removing that print? (no PR needed)

👍🏼

akshaysharmajs commented 2 years ago

Well, now they are passing!

Gallaecio commented 2 years ago

I’ve just run into something that may be worth addressing as part of these changes: https://github.com/scrapy/scrapy/blob/624a1ff3e97e693e85546a54a7abba3d94bbbebb/scrapy/downloadermiddlewares/httpcompression.py#L71-L74

akshaysharmajs commented 2 years ago

I’ve just run into something that may be worth addressing as part of these changes:

https://github.com/scrapy/scrapy/blob/624a1ff3e97e693e85546a54a7abba3d94bbbebb/scrapy/downloadermiddlewares/httpcompression.py#L71-L74

Thanks for mentioning it. I will consider it. Though I am not getting enough time to make the changes 😅

akshaysharmajs commented 1 year ago

@Gallaecio I guess these changes are working. Please review it. Let me know if you have any concerns. (Some other files are failing the tests, Idk why)

akshaysharmajs commented 1 year ago

I think I have figured out whats failing the tests. Here, we are using 'http://www.example.com' as url which is forcing the content_types to be (b'application/x-msdos-program',) . But we should add a / in the end of all URLs, like 'http://www.example.com/'. Making this change is passing the test.

Gallaecio commented 1 year ago

I have done some refactoring, I hope that is OK.

I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.

I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge https://github.com/scrapy/xtractmime/pull/12 and release the first public version.

akshaysharmajs commented 1 year ago

I have done some refactoring, I hope that is OK.

Yeah, Thank you!

I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.

No problem, Let me know if I can help with that

I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge scrapy/xtractmime#12 and release the first public version.

That would be awesome, please keep me posted!

Gallaecio commented 1 year ago

xtractmime is now published on PyPI :tada:

akshaysharmajs commented 1 year ago

xtractmime is now published on PyPI 🎉

Thank you so much 😍

Gallaecio commented 1 year ago

Note to self: Make sure we are testing that if Content-Header reports gzip but content is plain text (e.g. b"\r\n") httpcompress does not try to decode the response, failing. Unless that is supposed to happen.

Gallaecio commented 1 year ago

I basically reverted some of the pro-backward-compatibility changes that I had previously asked Akshay to make. So these changes follow the standard almost to the letter, except for 2 exceptions (see the issue description for details).

Because the new behavior, close to the standard, changes a lot, specially a lot of cases where before you would get text-based response class and now you would get a binary response class, it would be good, as a next step, to actually test how much these changes affect real-world scenarios.

@kmike suggested off GitHub that we build a test that basically uses browser automation to target the homepage of popular domains, records the URLs downloaded when rendering each homepage, and then we see, for those URLs, where the new implementation would cause a change.

Based on the results of such a test, we can determine if we want to keep things close to the standard, whether we want to deviate an if so to what extent, and whether we want to allow some behavior changes (e.g. force body-based response class choice, or force a given response class altogether) based on user input (e.g. request meta or settings).

Gallaecio commented 4 months ago

I’ve recorded URLs using browser rendering on the home page of the top 50 domains, then downloaded them with Zyte API and checked which response class each implementation would use, and analyzed the results:

record.py

```python from collections import deque from playwright.async_api import Response as PlaywrightResponse from scrapy import Request, Spider, signals from scrapy.exceptions import DontCloseSpider class RecordSpider(Spider): name = "record" custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", } queue = deque() domains = [ "google.com", "youtube.com", "facebook.com", "pornhub.com", "xvideos.com", "twitter.com", "wikipedia.org", "instagram.com", "reddit.com", "amazon.com", "duckduckgo.com", "yahoo.com", "xnxx.com", "tiktok.com", "bing.com", "yahoo.co.jp", "weather.com", "whatsapp.com", "yandex.ru", "xhamster.com", "openai.com", "live.com", "microsoft.com", "microsoftonline.com", "linkedin.com", "quora.com", "twitch.tv", "naver.com", "netflix.com", "office.com", "vk.com", "globo.com", "aliexpress.com", "cnn.com", "zoom.us", "imdb.com", "x.com", "newyorktimes.com", "onlyfans.com", "espn.com", "amazon.co.jp", "pinterest.com", "uol.com.br", "ebay.com", "marca.com", "canva.com", "spotify.com", "bbc.com", "paypal.com", "apple.com", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super().from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle) return spider def start_requests(self): for domain in self.domains: yield Request( url=f"https://{domain}", meta={ "playwright": True, "playwright_page_event_handlers": { "response": "handle_response", }, }, ) async def handle_response(self, response: PlaywrightResponse) -> None: if ( 300 <= response.status < 400 or response.request.method != "GET" ): return self.queue.append({"url": response.url}) def parse(self, response): while self.queue: yield self.queue.pop() def spider_idle(self, spider): if self.queue: self.crawler.engine.download(Request("data:,", callback=self.dump_queue)) raise DontCloseSpider def dump_queue(self, response): while self.queue: yield self.queue.pop() ```

test.py

```python import jsonlines from scrapy import Request, Spider from scrapy.responsetypes import responsetypes from scrapy.utils.response import get_response_class class TestSpider(Spider): name = "test" custom_settings = { "DOWNLOAD_HANDLERS": { "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler", }, "DOWNLOADER_MIDDLEWARES": { "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000, }, "REQUEST_FINGERPRINTER_CLASS": "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter", "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", "ZYTE_API_TRANSPARENT_MODE": True, } def start_requests(self): with jsonlines.open("urls.jsonl") as reader: for item in reader: yield Request(item["url"]) def parse(self, response): headers_bytes_dict = { key: b",".join(value) for key, value in response.headers.items() } old_cls = responsetypes.from_args( headers=headers_bytes_dict, url=response.url, body=response.body, ) new_cls = get_response_class( http_headers=headers_bytes_dict, url=response.url, body=response.body, ) yield { "url": response.url, "old_cls": old_cls.__name__, "new_cls": new_cls.__name__, "changed": new_cls is not old_cls, } ```

analyze.py

```python import jsonlines def main(): binary = gif_text = gif_html = json = svg = 0 unexpected = [] with jsonlines.open("result.jsonl") as reader: for item in reader: if not item["changed"]: continue if item["url"].startswith("https://log.go.com/") and item["old_cls"] == "TextResponse" and item["new_cls"] == "Response": binary += 1 binary_example = item["url"] continue if "pagead2.googlesyndication.com" in item["url"] and item["old_cls"] == "TextResponse" and item["new_cls"] == "Response": gif_text += 1 gif_text_example = item["url"] continue if item["url"] == "https://www.marca.com/ue-cdn/services/cdn_cookie_service.html" and item["old_cls"] == "HtmlResponse" and item["new_cls"] == "Response": gif_html += 1 gif_html_example = item["url"] continue if "openaicom-api" in item["url"] and item["old_cls"] == "TextResponse" and item["new_cls"] == "JsonResponse": json += 1 json_example = item["url"] continue if ("svg" in item["url"] or "https://static.licdn.com/aero-v1" in item["url"]) and item["old_cls"] == "TextResponse" and item["new_cls"] == "XmlResponse": svg += 1 svg_example = item["url"] continue unexpected.append(item) print(f"Binary (Content-Type: application/octet-stream): TextResponse → Response: {binary} (e.g. {binary_example})") print(f"GIF: HtmlResponse → Response: {gif_html} (e.g. {gif_html_example})") print(f"GIF: TextResponse → Response: {gif_text} (e.g. {gif_text_example})") print(f"JSON: TextResponse → JsonResponse: {json} (e.g. {json_example})") print(f"SVG: TextResponse → XmlResponse: {svg} (e.g. {svg_example})") print(f"Unexpected: {len(unexpected)}") for item in unexpected: print(item) if __name__ == "__main__": main() ```

stdout

```none Binary (Content-Type: application/octet-stream): TextResponse → Response: 1 (e.g. https://log.go.com/log?appid=DTCI-ONEID-UI&client_id=ESPN-ONESITE.WEB-PROD&sdk_version=web%204.3.209&lightbox_version=4.3.209×tamp=1703838314840&action_name=event%3Aerror&info=payload-included(true)%2Cevent-payload(Session%20not%20established)&context=direct&source=espndeportes&conversation_id=4ac397fb-583f-4a4a-a18c-10ef40d283ba&trace=0%7CJIOWBVgQQGWAtKkDyIB8BDAdgeywTwFsBLALwFMAfUCaORFdAYwAtymBrAMRwCdkAQsnA4O5LNTCRYCJMFRpe5DABMAyuQDOm4nkpqAomrXyQAfTXgkBs1bUBpNWlbsOagC4Z35IA%3D%3D%3D&swid=AEE051E5-D27D-44D6-C450-CF9B07FF5CFA&anon=true) GIF: HtmlResponse → Response: 1 (e.g. https://www.marca.com/ue-cdn/services/cdn_cookie_service.html) GIF: TextResponse → Response: 19 (e.g. https://pagead2.googlesyndication.com/pagead/gen_204?id=sodar&v=44&t=2&bgai=Bsu5LWoKOZZHrM4XqbfPtv7gNAAAAADgB4AQC&bg=!nZ6lntHNAAY3kmNgF5I7ADQBe5WfOKbIZawkXli-YRHlLRndTLotIg35tDrBxDRAIB3_z-1ScB1aBl2SNhvIsx0X9hnwAgAAAJ9SAAAAA2gBB5kDDKX9jqoKyo7DRnWujuCMJcOigGqPepE_4ucKRvfzZPhlY2AtDHMghw4duel26Lk-hOwFmDGQzM3knweK8YPbqQkrDbgmBnkNvzQuyOeNdm1tj2UdaWEkoHyYgXXI1zhkWqR76QYbnPSHTohx1C1yWms9dir12QvN78KMrSIqfOXT0QnAIqgnNzvqW_iN8K7FWIdKQ9B4XVVnSAID-GkhbkOLSJQKrrqj77Vl0cwS8kDPov-MkFlaDYd0VW0whLbTttIQfXm-7weVupE2IR8l9_1uxiF_8ECLlWkibPHB8ZDFxDdA2HnnGbbVFHi0r6mJ8hyu4f9BPVMhSf0QqkQdalnOlo-0g7o2oRWORPFwkXMwMfxTxLqndYiCj1stFL5mcZ7nJqjE5X-vSJ0ok6RzXqF6Fab8-JV1iiL1vE-o2Re7fWBtNrl6A42qzFKO-aLXhBtsWBm_2bDjtBE9kJzJJNEURmOl8QQ0ZT2oV5WZYmve0w19XelZDz0joQA9OomLJ6Nj5MK4XRHh6yDkhVVDkTzmwWccDzSrJedQDJ7DOMZ0Ch02WoeWerIXIGJ_OPh1t8Yg4wsPWSkS1GvFsMl2KXSmq5iTDpQHpDVrEvm74Dtb3mWL-S51nfNtMHShUreQHIkQVT-5QhYn1YeyxUeLToq7hyFhMzUyFw6cveXg2eay50ZTy_5O3yF1KDCm8GjPhHsbRPtXw9bKyolKKSR5Jv-VYWQDs8Hn3muilx2rQVN93aXB0pHjPbgJHNRBHuoyjHVKvtcNVuSY5wHbLzHXAfetYjoydgvalpXLDOlDasxzrufPGJ-g61VsgTguyqcERqNLNwfHlncDAs5CxpzCQIfK4D0r_HpytvYBznpS3lbhXXW4oy5KhDtsx9gyni7vLQQwlSw6ou10sUPe17tcSbgJ2VPiLnoHCjOAvybNxMQpcaWdVIXss1Tc8Bu8jjR6bydhJJRhLGny6hHXfdHsB8Orm-R3gJ0CWXM1a6LguTt6NLmxl3UVvKfjN0Ylgz-LT9C8jv0s1m-T5m87fg) JSON: TextResponse → JsonResponse: 2 (e.g. https://openaicom-api-bdcpf8c6d2e9atf6.z01.azurefd.net/api/v1/blog-details?sort=-publicationDate%2C-createdAt&page%5Bsize%5D=4&include=media%2Ctopics&filter%5Bpublished%5D=true) SVG: TextResponse → XmlResponse: 213 (e.g. https://assets.espn.com/i/espnplus/espnPlusAllBlack.svg) Unexpected: 2 {'url': 'https://s3.glbimg.com/v1/AUTH_3ed1877db4dd4c6b9b8f505e9d4fab03/globoid-js/v1.10.0/globoid-js.min.js?loading-agent=global-webdeps', 'old_cls': 'TextResponse', 'new_cls': 'Response', 'changed': True} {'url': 'https://fourier.aliexpress.com/ts?url=&token=BAkJZTgpiDZVk3TCmUfKjc76EzxjVv2IXXM8s6t-hfAv8ikE86YNWPckME_EsZXA&cna=aHAVHsIO3FQCAS4GBFnpYbPW&ext=1', 'old_cls': 'TextResponse', 'new_cls': 'Response', 'changed': True} ```

Overall results seem for the better, but I have to look further into the 2 of the URLs to understand why the new implementation treats them as binary.

Gallaecio commented 4 months ago

https://github.com/scrapy/xtractmime/pull/18 fixes the first unexpected scenario.

Gallaecio commented 4 months ago

The other unexpected scenario seems to be related to website bans. On a browser the Content-Type is application/json, but Scrapy gets image/gif, in which case Response is the right class.

Gallaecio commented 4 months ago

@kmike Should we extend tests to more domains, to tests some additional scenarios?

kmike commented 4 months ago

@Gallaecio this is a great start 👍

Yes, I think we should extend the test to more domains (1000-10000), including unpopular domains.

Gallaecio commented 4 months ago

I found https://github.com/opendns/public-domain-lists and went with the combination of the 2 lists (20 000 domains), even though many of those domains are not active anymore since the repository is 10 years old.

This resulted in 448 800 URLs being analyzed. In 18 435 (4%) cases the response class was different in the new implementation.

By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse instead of TextResponse. But I don’t think numbers matter much here, it’s about deciding how we want to handle even the more rare scenarios.

We also use JsonResponse instead of TextResponse for more responses (64) with a Content-Type ending in +json, use XmlResponse instead of HtmlResponse for XHTML documents (5), and use XmlResponse instead of HtmlResponse for pure XML documents mislabeled as HTML.

The new implementation uses Response instead of TextResponse for some binary responses (125) that indicate so in their Content-Type (application/octet-stream or binary/octet-stream). Many of these are font files (e.g. TTF, WOFF).

~However, there are also cases where Response is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response is used now for a plain text file because the Content-Type is custom, and does not start with text/ or end with +json or +xml.~

~I wonder what we should do with these scenarios where Response is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas:~

~Re-run xtractmime ignoring Content-Type when the result is Response.~
~Allow users to force the ignoring of the Content-Type value.~
~Allow users to force a specific MIME type to be passed to xtractmime.~
~Allow users to force a specific response class.~

Many image and video files now use Response instead of HtmlResponse or TextResponse: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.

~Some text-based media files now use Response instead of TextResponse, but we could map them to TextResponse by MIME type: M3U8 (12).~

In 12 cases of empty responses without a Content-Type, HtmlResponse and JsonResponse are replaced by TextResponse.

And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:

~A JSON with a custom Content-Type that does not end in +json (2) switches from TextResponse to Response.~
A JSON with an invalid Content-Type used to get JsonResponse (because of the URL path ending in .json?), and now would get TextResponse. Because the Content-Type is invalid and not just custom, xtractmime inspects the body to determine the type, and chooses text/plain as Content-Type. ~When Content-Type is valid (previous scenario), xtractmime returns it as is, and if Scrapy does not recognize it, it uses Response.~
~https://ckstatic.com/ returns an empty response with Content-Type set to application/x-directory, so instead of TextResponse it now uses Response.~
A binary response with Content-Type of text/plain now triggers Response instead of TextResponse.
~The non-standard application/jsonp MIME type now causes Response to be used instead of TextResponse. JSONP is usually a JavaScript file wrapping JSON data, but they are supposed to use a JavaScript MIME type.~
~The non-standard application/text MIME type now causes Response to be used instead of TextResponse.~
A binary file with SVG MIME type now uses XmlResponse instead of Response.
There is a case of an HTML document without Content-Type and .css file extension that before used TextResponse and now would use HtmlResponse.
JavaScript files with .js file extension but a JSON MIME type before used TextResponse and now would use JsonResponse.
A plain text file with a null character switches from TextResponse to Response.
An empty response with no Content-Type and .json file extension switches from JsonResponse to TextResponse.
A Vue file with Content-Type as text/html switched from TextResponse to HtmlResponse.
An HTML fragment starting with <div with no Content-Type switches from TextResponse to HtmlResponse.

kmike commented 4 months ago

By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse instead of TextResponse.

This seems like a good change.

We also use JsonResponse instead of TextResponse for more responses (64) with a Content-Type ending in +json

Sounds good

use XmlResponse instead of HtmlResponse for XHTML documents (5)

It seems the largest change here is that the selector type is chaned from html to xml. According to https://lxml.de/parsing.html, it is good ("HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware."), but I don't know what it means in practice. Anyways, probably a good change.

and use XmlResponse instead of HtmlResponse for pure XML documents mislabeled as HTML.

Sounds good, although I wonder if it can break some selectors.

The new implementation uses Response instead of TextResponse for some binary responses (125) that indicate so in their Content-Type (application/octet-stream or binary/octet-stream). Many of these are font files (e.g. TTF, WOFF).

This seems fine.

However, there are also cases where Response is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response is used now for a plain text file because the Content-Type is custom, and does not start with text/ or end with +json or +xml.

I wonder what we should do with these scenarios where Response is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas: ...

We may also err on a side of returning TextResponse. Do we have a way to explicitly detect binary data now? I.e. add explicit binary detction to our code, and if there is something unknown, use TextResponse by default. But I'm not sure, it also needs some kind of experiment.

Many image and video files now use Response instead of HtmlResponse or TextResponse: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.

👍

Some text-based media files now use Response instead of TextResponse, but we could map them to TextResponse by MIME type: M3U8 (12).

Makes snese, why not.

In 12 cases of empty responses without a Content-Type, HtmlResponse and JsonResponse are replaced by TextResponse.

I think that's fine.

And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:

(I'll stop for now, answer later)

Gallaecio commented 3 months ago

I went with “Re-run xtractmime ignoring Content-Type when the result is Response” (https://github.com/scrapy/scrapy/pull/5204/commits/13bc1499ac2191d39bf9643e6c3b3770a2207240).

scrapy / scrapy

Integrating xtractmime into Scrapy #5204

Changes

Not implemented

To-do

Codecov Report