Open akshaysharmajs opened 2 years ago
Merging #5204 (13bc149) into master (1c9d308) will increase coverage by
0.03%
. The diff coverage is94.36%
.
What I understand by looking into https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py, I think from_args
is the main function required by other scrapy files for mime sniffing. I think calling xtractmime.extract_mime
with different parameters based on what arguments are passed in from_args
will be good. I am not sure other functions in responsetypes.py are required now or not?
Also, CLASSES
needs to be updated with more mime types and response classes but I am not sure what all can be added to it like application/pdf
can be one.
[…] I think
from_args
is the main function […] for mime sniffing. I think callingxtractmime.extract_mime
[…] infrom_args
will be good.
I think so too.
I am not sure other functions in responsetypes.py are required now or not?
xtractmime will basically replace all other methods there. We will need to keep them around for backward compatibility, but I imagine that, as part of this pull request, we should have them all log a warning except for from_args
.
Also,
CLASSES
needs to be updated with more mime types and response classes but I am not sure what all can be added to it likeapplication/pdf
can be one.
We don’t need additional classes.
Response
is the right class for any binary response (e.g. PDF), and it is already used for any MIME type not mapped in CLASSES
, so there’s nothing you need to do about binary MIME types.
If you can think of additional MIME types that make sense for one of the existing Response
subclasses (HtmlResponse
, XmlResponse
, TextResponse
), then please do feel free to update CLASSES
accordingly.
Related to that, although not achievable simply extending CLASSES
: the standard taught me that any MIME type ending in +xml
is to be treated as an XML file, so maybe it would make sense to modify the class so that it uses XmlResponse
when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying on CLASSES
altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.
mime_type = extract_mime(…)
if is_html_mime_type(mime_type):
return HtmlResponse
if is_xml_mime_type(mime_type):
return XmlResponse
if (
mime_type.startswith('text')
or is_json_mime_type(mime_type)
or is_javascript_mime_type(mime_type)
):
return TextResponse
return Response
Related to that, although not achievable simply extending
CLASSES
: the standard taught me that any MIME type ending in+xml
is to be treated as an XML file, so maybe it would make sense to modify the class so that it usesXmlResponse
when that’s the case, even for unknown MIME types. In fact, maybe you could stop relying onCLASSES
altogether and instead expose some methods based on https://mimesniff.spec.whatwg.org/#mime-type-groups in xtractmime and use them here, e.g.mime_type = extract_mime(…) if is_html_mime_type(mime_type): return HtmlResponse if is_xml_mime_type(mime_type): return XmlResponse if ( mime_type.startswith('text') or is_json_mime_type(mime_type) or is_javascript_mime_type(mime_type) ): return TextResponse return Response
That's a great idea, I will add this functionality to xtractmime 👍🏼
What can be the value of the supported_types
parameter for extract_mime
? Is that required here or not?
A similar thing goes about
nosniff
. In the future we may want to expose a Scrapy setting to allow users to force sniffing regardless ofContent-Type-Options
. But since that feature is not in the current implementation, and we wouldn’t expect it to be used extensively, I think it is OK to leave that out until a time when a user requests that feature.
When I said this, I did not mean for you to remove your related code. I meant that, at some point in the future, users may ask to be able to send a custom value for this parameter to xtractmime, overriding whatever the X-Content-Type-Options
says, but that as far as this pull request goes, relying on X-Content-Type-Options
would be OK.
However, come to think of it, X-Content-Type-Options
could be exploited to prevent the used of a specialized response class. So maybe it is better not to rely on X-Content-Type-Options
for now, and maybe in the future make it possible to rely on it, opt-in, through a setting.
I have added the pre n post xtractmime tests with expected behavior as comments. There can be more failing scenarios, if I found one I will add it later. Still, a lot of tests are failing.
E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>
This is failing because mimetypes.MimeTypes()
returning a text/xml
content type instead of a application/gzip
>>> MimeTypes().guess_type("data.xml.gz")
('text/xml', 'gzip')
>>>
E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>
This is failing as we are not considering NULL byte anymore and xtractmime detecting b"\xfe\xff"
as a text/plain
instead of application/octet-stream
If you want I can update the existing comments for the tests based on updated behavior.
E AssertionError: {'headers': {b'Content-Disposition': [b'attachment; filename="data.xml.gz"']}, 'url': 'http://www.example.com/page/'} ==> <class 'scrapy.http.response.xml.XmlResponse'> != <class 'scrapy.http.response.Response'>
This is failing because
mimetypes.MimeTypes()
returning atext/xml
content type instead of aapplication/gzip
>>> MimeTypes().guess_type("data.xml.gz") ('text/xml', 'gzip') >>>
It looks like we need more complex logic than just taking the first item in the tuple that guess_type
returns.
Based on https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type, I think if the second item of the tuple is not none, we should interpret the MIME type as application/<tuple second value>
.
E AssertionError: {'body': b'\x00\xfe\xff', 'url': 'http://www.example.com/item/', 'headers': {b'Content-Type': [b'text/plain']}} ==> <class 'scrapy.http.response.text.TextResponse'> != <class 'scrapy.http.response.Response'>
This is failing as we are not considering NULL byte anymore and xtractmime detecting
b"\xfe\xff"
as atext/plain
instead ofapplication/octet-stream
If you want I can update the existing comments for the tests based on updated behavior.
Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See https://github.com/scrapy/scrapy/pull/5204#discussion_r679468793
I thought the integration part would be simpler, I was wrong 😅
Actually, I believe the current NULL byte replacement is too simple. We should only replace if there are no other binary data bytes, and the current approach just replaces NULL bytes always. See #5204 (comment)
Currently, I am checking the whole body for the binary bytes.
for index in range(len(body)):
if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]):
contains_binary_bytes = True
break
if not contains_binary_bytes:
body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")
Will it be better to just check it for body[:RESOURCE_HEADER_BUFFER_LENGTH]
I have created a separate PR for the response class computation using mimegroups. Please review https://github.com/akshaysharmajs/scrapy/pull/2/files
for index in range(len(body)): if body[index:index+1] != b"\x00" and contains_binary(body[index:index+1]): contains_binary_bytes = True break if not contains_binary_bytes: body = body[:RESOURCE_HEADER_BUFFER_LENGTH].replace(b"\x00", b"")
Will it be better to just check it for
body[:RESOURCE_HEADER_BUFFER_LENGTH]
I think so, yes. You could just set body = body[:RESOURCE_HEADER_BUFFER_LENGTH]
before the code above, and work with just body in this code.
Just out of curiosity, why some tests are giving this error ModuleNotFoundError: No module named 'xtractmime'
Just out of curiosity, why some tests are giving this error
ModuleNotFoundError: No module named 'xtractmime'
The testenv:docs
entry of tox.ini
does not install the same dependencies as the rest of packages. It installs based on setup.py
. (I did not check if there are import failures in other jobs, but if so the reasons are probably similar)
That said, maybe the test should be modified to change this, it makes sense for the documentation job to install all deps just as other tests (in addition to documentation deps). In fact, maybe the documentation job should install extra dependencies as well.
However, I don’t recommend you spend time on changing that, since things will solve themselves once we publish xtractmime in PyPI, which we should do before merging this anyway.
Are the latest test failures related to these changes?
Are the latest test failures related to these changes?
They are related to something else.
All the tests in test_responsetypes.py
are passing. But the following tests are giving this error ModuleNotFoundError: No module named 'xtractmime'
tests (3.6.12, pinned)
tests (3.6.12, asyncio-pinned)
tests (pypy3, pypy3-pinned, 3.6-v7.2.0
I’m closing and reopening the pull request to trigger new CI tests. The last run had all jobs failing but 4, with seemingly unrelated issues..
@akshaysharmajs Regarding the tests failing for the "pinned" environments, that's because the dependencies section for them do not inherit the main dependencies in tox.ini
. Adding git+https://github.com/scrapy/xtractmime.git@binary#egg=xtractmime
here should work.
=================================== FAILURES ===================================
_______________________ FetchTest.test_redirect_default ________________________
self = <tests.test_command_fetch.FetchTest testMethod=test_redirect_default>
@defer.inlineCallbacks
def test_redirect_default(self):
_, out, _ = yield self.execute([self.url('/redirect')])
> self.assertEqual(out.strip(), b'Redirected here')
/home/runner/work/scrapy/scrapy/tests/test_command_fetch.py:20:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
super().assertEqual(first, second, msg)
E twisted.trial.unittest.FailTest: b'Redirected here\n<memory at 0x7fd99aae3050>\n<memory at 0x7fd99aae3050>' != b'Redirected here'
___________________ ShellTest.test_response_encoding_gb18030 ___________________
self = <tests.test_command_shell.ShellTest testMethod=test_response_encoding_gb18030>
@defer.inlineCallbacks
def test_response_encoding_gb18030(self):
_, out, _ = yield self.execute([self.url('/enc-gb18030'), '-c', 'response.encoding'])
> self.assertEqual(out.strip(), b'gb18030')
/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:45:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
super().assertEqual(first, second, msg)
E twisted.trial.unittest.FailTest: b'<memory at 0x7f4611e4f120>\ngb18030' != b'gb18030'
____________________ ShellTest.test_response_selector_html _____________________
self = <tests.test_command_shell.ShellTest testMethod=test_response_selector_html>
@defer.inlineCallbacks
def test_response_selector_html(self):
xpath = 'response.xpath("//p[@class=\'one\']/text()").get()'
_, out, _ = yield self.execute([self.url('/html'), '-c', xpath])
> self.assertEqual(out.strip(), b'Works')
/home/runner/work/scrapy/scrapy/tests/test_command_shell.py:40:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/runner/work/scrapy/scrapy/.tox/py/lib/python3.7/site-packages/twisted/trial/_synctest.py:424: in assertEqual
super().assertEqual(first, second, msg)
E twisted.trial.unittest.FailTest: b'<memory at 0x7f1b7de2d120>\nWorks' != b'Works'
These 3 tests are still failing.
Should I add tests for _guess_response_type
and _guess_content_type
?
Should I add tests for
_guess_response_type
and_guess_content_type
?
I don’t think it is necessary as long as all their logic gets tested indirectly through other tests. It is hard to tell from the Codecov results, though, I think they won’t refresh until all tests pass.
Oh, I think I know what’s happening.
@akshaysharmajs Maybe you can push a commit to xtractmime’s main
branch removing that print
? (no PR needed)
Oh, I think I know what’s happening.
@akshaysharmajs Maybe you can push a commit to xtractmime’s
main
branch removing that
👍🏼
Well, now they are passing!
I’ve just run into something that may be worth addressing as part of these changes: https://github.com/scrapy/scrapy/blob/624a1ff3e97e693e85546a54a7abba3d94bbbebb/scrapy/downloadermiddlewares/httpcompression.py#L71-L74
I’ve just run into something that may be worth addressing as part of these changes:
Thanks for mentioning it. I will consider it. Though I am not getting enough time to make the changes 😅
@Gallaecio I guess these changes are working. Please review it. Let me know if you have any concerns. (Some other files are failing the tests, Idk why)
I think I have figured out whats failing the tests. Here, we are using 'http://www.example.com'
as url
which is forcing the content_types
to be (b'application/x-msdos-program',)
. But we should add a /
in the end of all URLs, like 'http://www.example.com/'
. Making this change is passing the test.
I have done some refactoring, I hope that is OK.
I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.
I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge https://github.com/scrapy/xtractmime/pull/12 and release the first public version.
I have done some refactoring, I hope that is OK.
Yeah, Thank you!
I still want to test a few things myself (e.g. how these changes affect to the decompression and HTTP compression downloader middlewares), but I may not have time for that for 1 or 2 weeks.
No problem, Let me know if I can help with that
I do think this pull request should no longer block the release of xtractmime, any further change here is unlikely to affect xtractmime. So we can probably merge scrapy/xtractmime#12 and release the first public version.
That would be awesome, please keep me posted!
Thank you so much 😍
Note to self: Make sure we are testing that if Content-Header reports gzip but content is plain text (e.g. b"\r\n"
) httpcompress does not try to decode the response, failing. Unless that is supposed to happen.
I basically reverted some of the pro-backward-compatibility changes that I had previously asked Akshay to make. So these changes follow the standard almost to the letter, except for 2 exceptions (see the issue description for details).
Because the new behavior, close to the standard, changes a lot, specially a lot of cases where before you would get text-based response class and now you would get a binary response class, it would be good, as a next step, to actually test how much these changes affect real-world scenarios.
@kmike suggested off GitHub that we build a test that basically uses browser automation to target the homepage of popular domains, records the URLs downloaded when rendering each homepage, and then we see, for those URLs, where the new implementation would cause a change.
Based on the results of such a test, we can determine if we want to keep things close to the standard, whether we want to deviate an if so to what extent, and whether we want to allow some behavior changes (e.g. force body-based response class choice, or force a given response class altogether) based on user input (e.g. request meta or settings).
I’ve recorded URLs using browser rendering on the home page of the top 50 domains, then downloaded them with Zyte API and checked which response class each implementation would use, and analyzed the results:
Overall results seem for the better, but I have to look further into the 2 of the URLs to understand why the new implementation treats them as binary.
https://github.com/scrapy/xtractmime/pull/18 fixes the first unexpected scenario.
The other unexpected scenario seems to be related to website bans. On a browser the Content-Type is application/json
, but Scrapy gets image/gif
, in which case Response
is the right class.
@kmike Should we extend tests to more domains, to tests some additional scenarios?
@Gallaecio this is a great start 👍
Yes, I think we should extend the test to more domains (1000-10000), including unpopular domains.
I found https://github.com/opendns/public-domain-lists and went with the combination of the 2 lists (20 000 domains), even though many of those domains are not active anymore since the repository is 10 years old.
This resulted in 448 800 URLs being analyzed. In 18 435 (4%) cases the response class was different in the new implementation.
By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse
instead of TextResponse
. But I don’t think numbers matter much here, it’s about deciding how we want to handle even the more rare scenarios.
We also use JsonResponse
instead of TextResponse
for more responses (64) with a Content-Type
ending in +json
, use XmlResponse
instead of HtmlResponse
for XHTML documents (5), and use XmlResponse
instead of HtmlResponse
for pure XML documents mislabeled as HTML.
The new implementation uses Response
instead of TextResponse
for some binary responses (125) that indicate so in their Content-Type (application/octet-stream
or binary/octet-stream
). Many of these are font files (e.g. TTF, WOFF).
~However, there are also cases where Response
is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response
is used now for a plain text file because the Content-Type
is custom, and does not start with text/
or end with +json
or +xml
.~
~I wonder what we should do with these scenarios where Response
is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas:~
Content-Type
when the result is Response
.~Content-Type
value.~Many image and video files now use Response
instead of HtmlResponse
or TextResponse
: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.
~Some text-based media files now use Response
instead of TextResponse
, but we could map them to TextResponse
by MIME type: M3U8 (12).~
In 12 cases of empty responses without a Content-Type, HtmlResponse
and JsonResponse
are replaced by TextResponse
.
And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:
Content-Type
that does not end in +json
(2) switches from TextResponse
to Response
.~Content-Type
used to get JsonResponse
(because of the URL path ending in .json
?), and now would get TextResponse
. Because the Content-Type
is invalid and not just custom, xtractmime inspects the body to determine the type, and chooses text/plain
as Content-Type
. ~When Content-Type
is valid (previous scenario), xtractmime returns it as is, and if Scrapy does not recognize it, it uses Response
.~Content-Type
set to application/x-directory
, so instead of TextResponse
it now uses Response
.~Content-Type
of text/plain
now triggers Response
instead of TextResponse
.application/jsonp
MIME type now causes Response
to be used instead of TextResponse
. JSONP is usually a JavaScript file wrapping JSON data, but they are supposed to use a JavaScript MIME type.~application/text
MIME type now causes Response
to be used instead of TextResponse
.~XmlResponse
instead of Response
.Content-Type
and .css
file extension that before used TextResponse
and now would use HtmlResponse
..js
file extension but a JSON MIME type before used TextResponse
and now would use JsonResponse
.TextResponse
to Response
.Content-Type
and .json
file extension switches from JsonResponse
to TextResponse
.Content-Type
as text/html
switched from TextResponse
to HtmlResponse
.<div
with no Content-Type
switches from TextResponse
to HtmlResponse
.By count, the most significant change (16 153, 3.6%) was SVG responses now using XmlResponse instead of TextResponse.
This seems like a good change.
We also use JsonResponse instead of TextResponse for more responses (64) with a Content-Type ending in +json
Sounds good
use XmlResponse instead of HtmlResponse for XHTML documents (5)
It seems the largest change here is that the selector type is chaned from html to xml. According to https://lxml.de/parsing.html, it is good ("HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware."), but I don't know what it means in practice. Anyways, probably a good change.
and use XmlResponse instead of HtmlResponse for pure XML documents mislabeled as HTML.
Sounds good, although I wonder if it can break some selectors.
The new implementation uses Response instead of TextResponse for some binary responses (125) that indicate so in their Content-Type (application/octet-stream or binary/octet-stream). Many of these are font files (e.g. TTF, WOFF).
This seems fine.
However, there are also cases where Response is now used because the response is mislabeled as binary when it is actually JavaScript (91), JSON (59), or some other type of plain-text file (27). There are also a few (6) cases where Response is used now for a plain text file because the Content-Type is custom, and does not start with text/ or end with +json or +xml.
I wonder what we should do with these scenarios where Response is used for a plain text response, where a different response class based on the actual response body content would usually be desired. In a browser, these files when accessed directly are downloaded instead of being rendered, but in web scraping we may be actually interested in parsing these. Some ideas: ...
We may also err on a side of returning TextResponse. Do we have a way to explicitly detect binary data now? I.e. add explicit binary detction to our code, and if there is something unknown, use TextResponse by default. But I'm not sure, it also needs some kind of experiment.
Many image and video files now use Response instead of HtmlResponse or TextResponse: GIF (1548), JPEG (220), WebP (43), PNG (21), AVIF (3), TIFF (3), WebM (1), Rive (1). Most of these were trackers, by the way.
👍
Some text-based media files now use Response instead of TextResponse, but we could map them to TextResponse by MIME type: M3U8 (12).
Makes snese, why not.
In 12 cases of empty responses without a Content-Type, HtmlResponse and JsonResponse are replaced by TextResponse.
I think that's fine.
And then there were the corner cases, which I find most interesting, as I wonder if/how we should deal with each of them:
(I'll stop for now, answer later)
I went with “Re-run xtractmime ignoring Content-Type
when the result is Response
” (https://github.com/scrapy/scrapy/pull/5204/commits/13bc1499ac2191d39bf9643e6c3b3770a2207240).
As per the discussion with @elacuesta and @Gallaecio. This PR will integrate xtractmime library into Scrapy for MIME sniffing
Fixes #2900, fixes #4240.
Changes
Behavior:
TextResponse
instead of toResponse
.Content-Encoding
is now alwaysResponse
(until uncompressed, when the header is removed and the response class mapping re-evaluated).Content-Disposition
, URL, and body can no longer result in a different response class.Content-Disposition
header: as long asContent-Type
is not defined or has a value equivalent to not being defined (e.g.*/*
),Content-Disposition
is taken into account, i.e. the MIME type matching the file extension is considered the supplied MIME type, as if it had been defined in theContent-Type
header. This is because in Scrapy we treat those responses the same as other responses (i.e. we do not download them in file pipelines automatically, we handle them in callbacks just like regular responses), so it seems appropriate to consider the server-suggested file extension to determine the right response class.Response
where the body would map to something likeTextResponse
. Before, the first things in the (headers, url, filename, body) that would map to something other thanResponse
would dictate the selected class. So, for example, if the header or URL indicated a binary response but the body was plain text, the result wasTextResponse
.Response
is used instead ofTextResponse
.text/html
, for which the body is inspected in case it is actually a news feed file, and if soXmlResponse
is used.application/xhtml+xml
andapplication/vnd.wap.xhtml+xml
now map toXmlResponse
instead of toHtmlResponse
. See “Note that XHTML is best parsed as XML” @ https://lxml.de/parsing.htmlapplication/ecmascript
andapplication/x-ecmascript
now map toTextResponse
instead ofResponse
.+json
(e.g.application/foo+json
,application/ld+json
) now always map toTextResponse
.+xml
(e.g.application/foo+xml
) now always map toTextResponse
.JsonResponse
orTextResponse
, as keeping them seems harmless:application/x-json
,application/json-amazonui-streaming
,application/x-javascript
. However, we may want to consider removing support for them to be more in line with the standard, sinceTextResponse
should get selected anyway based on the body.Response
or that did not exist at all, the actual response class used will generally not have changed, because for oldResponse
mappings body-based detection used to always run, in case the body content could suggest other thanResponse
. The advantage now is that we skip body-based detection altogether for those new mappings.\0
is now considered binary.\x0c
and\x1b
are no longer considered binary.Content-Encoding: br, gzip
) the HTTP compression middleware now applies all decoding, instead of applying only the last one. This is an unrelated change that I (@Gallaecio) deemed worth applying when I realized the issue when working on relevant changes. We could move it into its own pull request, though, if desired.API:
scrapy.responsetypes
module is deprecated in favor of the newscrapy.utils.response.get_response_class
functionscrapy.utils.response.get_base_url
function is deprecated in favor of the newTextResponse.base_url
property I do not recall the reason behind this change, it seems unrelated to this pull request, so we could move it into its own pull request if desired.Not implemented
X-Content-Type-Options
from a web server is not taken into account. We could implement it in the future. Note that it is not part of the MIME sniffing standard itself, but the result of its handling can be fed into the MIME sniffing standard algorithm, which accepts a nosniff flag.To-do
setup.py
andtox.ini
here accordingly.