scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

"Content-Encoding" header gets stripped from response headers #1988

Open mborho opened 8 years ago

mborho commented 8 years ago

See https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcompression.py#L36

IMHO the "Content-Encoding" header should get preserved, since the spider probably wants to see all the original response headers.

kmike commented 8 years ago

It makes sense to me, +1. Likely the header is changed/deleted to signal that response body is decoded; we should provide another way to do that (add a flag to response.flags?) if we keep the header.

The change is backwards incompatible, so I think we should make it optional, disable it by default, but enable in settings.py generated by scrapy startproject.

foromer4 commented 7 years ago

I started to implement this as suggested, but I'm unsure about the settings: in the httpcompression class. I can access the global settings by caliing new Settings(), but how can I change the value there for test purposes? If I use

 settings = Settings()
 settings.set('HTTPCACHE_REMOVE_ENCODING_HEADER_ON_DECODED_RESPONSE', 'False' , 'spider')

In the test code, it does not effect the value read within the httpcompression class ...

trim21 commented 5 years ago

"Content-Encoding" header is still removed from response.header and can't be disabled. Is there any progress about this issue up to now? Can I provide any help?

VorBoto commented 4 years ago

Hello, I am looking to try and help and this seemed the least intimidating task. I want to see if I'm getting my bearing on what is looking to be done correct.

There is a desire to have a bool that can be adjusted directly in default_settings.py, or in a different file, to prevent HttpCompresionMiddleware.process_responce() from stripping away the content-encoding header value.

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them? Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

Gallaecio commented 4 years ago

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them? Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

I believe the way to go is to store crawler.settings into self.settings in __init__, and then use self.settings.getbool() from process_response.

It may be even better to use self.settings.getbool() in __init__ and save the settings value instead to an instance variable, I’m simply not sure if the settings are fully filled by that time, I’m not familiar enough yet with the setting life cycle to know that off the top of my head.

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

There’s a setting to define the downloader middlewares that you wish to be loaded. If you find where that setting is used in code, you should be able to locate the point where they are imported. Although I don’t think you need to know to get this implemented.

VorBoto commented 4 years ago

Okay, that was my other thought but was worried I'd have to find every instance of HttpCompressionMiddleware being instantiated and add settings as an argument. I figure I can model it after the AjaxCrawlMiddleware that gets passed settings for its init and then has an instance variable for what its looking for: self.lookup_bytes = settings.getint('AJAXCRAWL_MAXSIZE', 32768)?

Also, find where AJAXCRAWL_ENABLED and AJAXCRAWL_MAXSIZE are stored so I place HTTPCOMPRESSION_HEADERS_KEEP in the hopefully correct file with it. (AJAXCRAWLER_ENABLE at least is in default_settings.py)

VorBoto commented 4 years ago

I got it to "pass" the tox tests after adding an init that stores an instance variable 'to keep or not keep' the headers. When I say pass it's only for py37 but there are still 13 xfails and 56 skips. I'm assuming on the feed the 'x' and 's' instead of '.' represent those respectively. It looks like all but one of the xfails was in test_squeues and the other in test_linkextractors.

I did modify the set up in the httpcompression test file by adding from scrapy.utils.tests import get_crawler and creating a crawler with both HTTPCOMPRESSION_HEADERS_KEEP and COMPRESSION_ENABLED passed for as settings dict. It failed without the COMPRESSION_ENABLED because the from_crawler method used to instantiate the instance test variable self.mw checks if compression is enabled first thing.

On the last 'thing' I pull requested @elacuesta was helpful at guiding me mentioning a few things. First, that in the logic at the tail end might result in the wrong value being kept in 'Content-Encoding' I think they meant that as the response is processed its encoding before process_repsonce will be different than what it leaves as so the value of 'Content-Encoding' would then be wrong wrt to the encoding that is returned in the processed response. Could you possibly confirm I'm understanding that correctly? Second, the use of custom metadata in the responsethat is being discussed but then deferring to what @kmike brought up above of having there be a flag placed in the subsequent response.

To do that it looks like I need to use Responce.replace(self, *args, **kwargs) which is already used in HttpCompressionMiddleware.process_responce. So was thinking I could add a new entry that corresponds to the proper 'Content-Encoding' into the method variable that is ultimately passed to the response.replace call.

This should be up to date if seeing the code makes this easier to understand.

T0shik commented 2 years ago

does this still need doing, do you want me to pick this up? Or for that fact any other "good first issue"? if just let me know seems like there are some open PR's however most of the are abandoned

Gallaecio commented 2 years ago

There is already https://github.com/scrapy/scrapy/pull/4025

Gallaecio commented 2 years ago

That said, it has been a while, though, and @VorBoto might have moved on. Maybe someone could look into resuming their work.

T0shik commented 2 years ago

@Gallaecio that's what I saw, I can pick up their branch and finish it up )

Gallaecio commented 2 years ago

Sounds good to me. Thanks!

delaneyscofield commented 3 weeks ago

Does this issue still need to be fixed or has it been resolved?

Gallaecio commented 3 weeks ago

Still pending, with 3 open PRs. It may not be as trivial as it sounds.