scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Wrong type(response) for binary responses #4240

Open ejulio opened 4 years ago

ejulio commented 4 years ago

Description

Binary (file) responses are identified as TextResponse instead of a plain Response in spider.parse.

Steps to Reproduce

From this URL https://www.ecb.europa.eu/press/pr/date/2004/html/pr040702.en.html We can get this link https://www.ecb.europa.eu/pub/redirect/pub_5874_en.html (text is pdf 1692 kB). It redirects to https://www.ecb.europa.eu/pub/pdf/other/developmentstatisticsemu200406en.pdf

Expected behavior: [What you expect to happen]

In the spider, isinstance(response, TextResponse) should be False.

Actual behavior: [What actually happens]

In the spider, isinstance(response, TextResponse) is True, even though the Content-Type header is application/pdf.

Versions

Scrapy       : 1.7.3
lxml         : 4.4.2.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic

Extra info

Probably we should handle known binary types in Content-Type here https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py At least, application/pdf should be in the list. It would be nice to add some mechanism to allow developers to extend the mapping on the fly, as we can get new types in a project basis and it would be easier update the desired behavior.

GeorgeA92 commented 4 years ago

@ejulio, @Gallaecio ScrapyAgent._cb_bodydone method.. chooses response class: https://github.com/scrapy/scrapy/blob/e22a8c8c36e34ffaf12ef9e330624df654582605/scrapy/core/downloader/handlers/http11.py#L395-L400 responsetypes.from_args method performs several checks with following logic:

https://github.com/scrapy/scrapy/blob/e22a8c8c36e34ffaf12ef9e330624df654582605/scrapy/responsetypes.py#L107-L119

In this case: .from_headers and .from_filename methods return plain Response class. method .from_body identified content of this pdf file as text. As result <class 'scrapy.http.response.text.TextResponse'> will be returned despite results of response headers and url checks: 1 As You can see on the debugger screenshot: this result of .from_body method based on checking only first 5000 bytes of response body while the same check on the whole response body clearly indicate that response is binary.

kmike commented 4 years ago

IMHO a correct solution would be to implement https://mimesniff.spec.whatwg.org/, though it looks like a significant amount of work (something like a GSoC project?). Ideally, as an external library which Scrapy uses. responsetypes is a poor man's mime sniffing implementation.

Gallaecio commented 4 years ago

There seems to be already such a project: https://pypi.org/project/mimesniff/

OmarFarrag commented 4 years ago

@Gallaecio @kmike I have gone through the code of mimesniff and the specs. Looks like it only implements section 7.1 Identifying a resource with an unknown MIME type, while there are other steps in the specs that should be done, which can be found in section 7 Determining the computed MIME type of a resource ..
I can work on that, maybe even through GSoC (student-alert) .. what do you think?

kmike commented 4 years ago

I think that'd be a great GSoC project. mimesniff library doesn't look enough indeed.

bchao1 commented 4 years ago

Hi @kmike @Gallaecio, I am a 3rd year EE/CS student interested in this year's GSoC. I've programmed in Python for 2 years and have some open-source experience. This is my Github: https://github.com/Mckinsey666.

I'm currently going through the MIME sniffing protocols and the scrapy functions mentioned above. Are there any other updates on this idea or possible contributions I can make before the GSoC proposal?

Thanks a ton!

Best, Brian

Gallaecio commented 4 years ago

@Mckinsey666 Great to hear from you!

Make sure you read http://gsoc2020.scrapinghub.com/participate carefully. The “Pre-Application Pull Request” section covers how to make contributions to support your GSoC proposals.

nyov commented 4 years ago

I don't know about mimesniff. At first peek, that looks like a case of NIH and a try at reinventing the venerable file magic mime detection? They seem to have "standardized" a handful of known byte patterns for the most common file-types there, but it seems limited in scope. From a quick look, I only saw many how-to-write-parser-for-x pseudocode lines over there. Is there a benefit with mimesniff over the existing file "magic" library, which can identify a ton of files? (Maybe they can work with HTTP range-requests or have other web-relevant features spec'd, that a filesystem-based tool like file wouldn't?)

libmagic has a well-known plaintext description format in BNF grammar, which let's you write a new format parser in minutes and compile tons of those to a binary map for fast lookups by libmagic. Not all the existing indicators shipped with file may translate 1:1 to web usage, since they can and do detect a lot more than just mime-type (number of files in archive, etc.), but scrapy could ship it's own customized "magic" file, if linking a c library is acceptable for a Scrapy library here. Several python wrappers exist and should be easy to adapt or write from scratch.

A possible argument against file may be that it isn't really built for stream-based parsing I think (why, if it can usually just malloc the file); but then most file-types are identifiable with the first few kbytes, and we don't care to know the number of archive members in a zip here, so simply throwing the first ~16KB buffer of a response at the fingerprinting logic should usually be sufficient (of course if we have the whole response.body in memory anyway, using that can have better results for text encoding detection).

Again, I don't know much about mimesniff at this point, but working with libmagic could have pretty much immediate results: all the parser code exists. I could probably write a downloader middleware for that in a day or two.

gigatesseract commented 4 years ago

Hey @Gallaecio @kmike ! I'm Praveen, a CS student in my pre-final year. I was looking through the standard for MIME sniffing, and would like to implement a standard for the same, as part of GSOC2020. Currently, I am going through the MIME sniff standard, existing library implementations and MIME sniffing vulnerabilities and looking at the lines of what @nyov has mentioned. In the "Pre-Application Pull request", should we tackle issues under the "Contribute" tab of the projects or any issue in general pertaining to them ?
http://gsoc2020.scrapinghub.com/projects (I am referring to these projects)
Also should we build on top of libmagic or implement the standard from scratch ?

Regards, Praveen

kmike commented 4 years ago

I'm not sure we should be using libmagic here, as the goal is to be compatible with browser MIME sniffing, the way browsers implement it, not to implement a general file type detection. Maybe libmagic is compatible with the browser detection algorithm, but I haven't found mentions of libmagic/file in https://mimesniff.spec.whatwg.org/.

nyov commented 4 years ago

@gigatesseract, +1 for mentioning and looking at MIME sniffing vulnerabilities. I can't answer authoritatively on the GSoC q, but so it doesn't get overlooked: I believe the /contribute issue list is for onboarding, it says

Each issue displayed here is a "good first issue," selected for its relative approachability for first-time contributors.

As such, I believe you're free to contribute to issues with a higher level of difficulty, if they fit otherwise.

Gallaecio commented 4 years ago

Also should we build on top of libmagic or implement the standard from scratch ?

Whatever works best. If existing libraries can help with long-term maintenance or performance, they should definitely be considered. DRY and KISS always apply :slightly_smiling_face:

sakshamb2113 commented 4 years ago

@kmike @Gallaecio @nyov I am reading about the current MIME sniffing implementations and the security issues related to it. I am interested in working on this project for GSoC 2020. As @kmike said that libmagic may not necessarily implement the browser mimetype detection algorithms maybe building the MIME sniffing standard from scratch is a better idea?

Gallaecio commented 4 years ago

I believe that is @kmike’s point, yes. We are open to reusing existing libraries, but the point is to get the best possible implementation, and at the moment it looks like the best way to do that may be a custom implementation from scratch.

nyov commented 4 years ago

And I disagree on that point with @kmike , but that's my personal point of view. I'm not who you need to impress if you're a gsoc student :smile: .

To my understanding, mime sniffing has two problem domains: header sniffing of a passed MIME type (and evaluating "no-sniff" X-Content-Type-Options header etc.): these can't be done by file/magic. This is the logic to decide if content sniffing is needed at all, or if we go by headers or file-endings and stuff.

And then the other is content body sniffing. These are all the byte pattern and -mask things listed in that mimesniff spec page. These would be easily implemented using file magic grammar, from what I can see, and you'd be hard-pressed to reinvent a similarly extensible parser in a gsoc project. "mimesniff" stops after speccing mp4, webm and mp3 media type parsers, but file has pattern magic to detect some hundred of movie format variations alone, out of the box. Wrapping libmagic for scrapy could give users here a much more customizable framework solution later, than a rigid custom parser that implements the bare minimals given by "mimesniff".

Just a single example. Let's use GIF. That whatwg spec page lists it like this:

Byte Pattern Pattern Mask Leading Bytes to Be Ignored Image MIME Type Note
47 49 46 38 37 61 FF FF FF FF FF FF None. image/gif The string "GIF87a", a GIF signature.
47 49 46 38 39 61 FF FF FF FF FF FF None. image/gif The string "GIF89a", a GIF signature.

Here is how you would write a libmagic implementation fitting that description:

# GIF
0       string          GIF8            GIF image data
>4      string          7a              \b, version 8%s
!:mime  image/gif
>4      string          9a              \b, version 8%s
!:mime  image/gif

That's it, pretty much copied from the existing code. It says: start at byte 0 (don't ignore any leading bytes), then scan for the string "GIF8". If that matches (go to the next line), skip (those) 4 bytes and then look for string "7a", or "9a", and print that as the gif version number. If either of these match, return the mime type as "image/gif". If that looks too easy, you can tell magic to look for "47 49 46 38" as big endian byte pattern instead (now it looks more like the mime-sniff spec thing):

0       belong          0x47494638       GIF image data
>4      beshort         0x3961           \b, version 89a 
!:mime  image/gif

And libmagic lets you scan for all kinds of little, big or native endian bytestrings, floating point numbers; and more. Or let's see how libmagic detects webm by default, which requires 40 lines of parser steps on the mimesniff page:

0               belong          0x1a45dfa3
>4              search/4096     \x42\x82
>>&1            string          webm            WebM
!:mime  video/webm

I won't explain those 4 lines, but they do pretty much what the whatwg parser steps ask for, if not as rigidly, except for the bigger search buffer of 4096 here versus 38-4 bytes in the "spec".

Gallaecio commented 4 years ago

I think it would make sense to implement the body-based detection in libmagic syntax following the conventions from the standard, and have regular libmagic as a fallback for cases not supported by the standard.

I think the point of aiming to implement the standard is to imitate the behavior of a web browser (which we are assuming will follow the standard as well, if they don’t do it already), potentially even making the same mistakes on purpose in some scenarios, unless there’s a reason why we should not.

sakshamb2113 commented 4 years ago

@ejulio @Gallaecio I have made a draft proposal for this project. Here is the link https://docs.google.com/document/d/1Zpmm-4mKGr-s74UWGRRooMIjl89PMmefB-dRmJiWXAk/edit?usp=sharing I would be very happy to get some feedback. Comment access has been given.

sakshamb2113 commented 4 years ago

@ejulio @Gallaecio I have updated my proposal. It would be great if you could provide some feedback about it.

Gallaecio commented 4 years ago

Looks good to me. I was personally hoping to see libmagic-related work earlier out of pure curiosity, but the proposed calendar makes sense, better to follow an order close to that of the specification we are implementing :slightly_smiling_face:

sakshamb2113 commented 4 years ago

Thanks a lot @Gallaecio @ejulio for the reviews. I have tried to incorporate your feedback into the proposal and made some changes.

deepakdinesh1123 commented 3 years ago

is this issue still open for contribution?

Gallaecio commented 3 years ago

It is still open.

I’ll take the chance to mention sniffpy, although I think that, just as mimesniff, fails to follow the spec, at least not completely.

dukkee commented 3 years ago

Hello, @Gallaecio! I will be happy to write a library with full spec support as a part of my GSoC 2021 participation. Can we communicate via any other messengers (discord, FB messenger, etc.) or it can be only GitHub?

Gallaecio commented 3 years ago

That’s good to hear! Please, follow the communication guidelines described at https://gsoc2021.zyte.com/participate#communication

dukkee commented 3 years ago

@Gallaecio two additional questions: as it will be a new library, how do I need to do with the next rule: Before you apply, you must create a pull request against the repository of the project that you want to target in your application? So I must send PR to any repo of your organization or specific one in this case?

Gallaecio commented 3 years ago

So I must send PR to any repo of your organization or specific one in this case?

In this case I guess you are free to choose the target project. Just make sure you pick one that is actively maintained, i.e. one with recent commits.

Can I create a new separate issue about this new lib here to communicate with you and regarding lib details, and regarding gsoc details at all?

I think we can keep that discussion to this issue, as it’s quite relevant to the issue.

akshaysharmajs commented 3 years ago

Hello @Gallaecio, I have submitted a proposal for this year's GSoC program. Here is the link to it: https://docs.google.com/document/d/1X9g62mNxYI305nfiAAmYkrOWkQMiaToni7kIWx30TKA/edit?usp=sharing My apologies for showing it to you this late, I wasn't sure if I would be able to complete my proposal in time as I am having my final exams this week. Still, it would be great if you can review it once. Thank you for your time, looking forward to working on this project as a GSoC student.

Gallaecio commented 3 years ago

It’s too late to provide you with feedback, but we will certainly take it into consideration, thanks!