Open mborho opened 8 years ago
It makes sense to me, +1. Likely the header is changed/deleted to signal that response body is decoded; we should provide another way to do that (add a flag to response.flags?) if we keep the header.
The change is backwards incompatible, so I think we should make it optional, disable it by default, but enable in settings.py generated by scrapy startproject
.
I started to implement this as suggested, but I'm unsure about the settings:
in the httpcompression class.
I can access the global settings by caliing
new Settings()
,
but how can I change the value there for test purposes? If I use
settings = Settings()
settings.set('HTTPCACHE_REMOVE_ENCODING_HEADER_ON_DECODED_RESPONSE', 'False' , 'spider')
In the test code, it does not effect the value read within the httpcompression class ...
"Content-Encoding" header is still removed from response.header
and can't be disabled.
Is there any progress about this issue up to now?
Can I provide any help?
Hello, I am looking to try and help and this seemed the least intimidating task. I want to see if I'm getting my bearing on what is looking to be done correct.
There is a desire to have a bool that can be adjusted directly in default_settings.py, or in a different file, to prevent HttpCompresionMiddleware.process_responce() from stripping away the content-encoding header value.
Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them? Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?
Also where are downloadermiddleares called or imported I have been looking but not finding yet.
Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them? Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?
I believe the way to go is to store crawler.settings
into self.settings
in __init__
, and then use self.settings.getbool()
from process_response
.
It may be even better to use self.settings.getbool()
in __init__
and save the settings value instead to an instance variable, I’m simply not sure if the settings are fully filled by that time, I’m not familiar enough yet with the setting life cycle to know that off the top of my head.
Also where are downloadermiddleares called or imported I have been looking but not finding yet.
There’s a setting to define the downloader middlewares that you wish to be loaded. If you find where that setting is used in code, you should be able to locate the point where they are imported. Although I don’t think you need to know to get this implemented.
Okay, that was my other thought but was worried I'd have to find every instance of HttpCompressionMiddleware being instantiated and add settings as an argument. I figure I can model it after the AjaxCrawlMiddleware that gets passed settings for its init and then has an instance variable for what its looking for: self.lookup_bytes = settings.getint('AJAXCRAWL_MAXSIZE', 32768)
?
Also, find where AJAXCRAWL_ENABLED
and AJAXCRAWL_MAXSIZE
are stored so I place HTTPCOMPRESSION_HEADERS_KEEP
in the hopefully correct file with it. (AJAXCRAWLER_ENABLE at least is in default_settings.py)
I got it to "pass" the tox tests after adding an init that stores an instance variable 'to keep or not keep' the headers. When I say pass it's only for py37 but there are still 13 xfails and 56 skips. I'm assuming on the feed the 'x' and 's' instead of '.' represent those respectively. It looks like all but one of the xfails was in test_squeues and the other in test_linkextractors.
I did modify the set up in the httpcompression test file by adding from scrapy.utils.tests import get_crawler
and creating a crawler with both HTTPCOMPRESSION_HEADERS_KEEP and COMPRESSION_ENABLED passed for as settings dict. It failed without the COMPRESSION_ENABLED because the from_crawler method used to instantiate the instance test variable self.mw
checks if compression is enabled first thing.
On the last 'thing' I pull requested @elacuesta was helpful at guiding me mentioning a few things. First, that in the logic at the tail end might result in the wrong value being kept in 'Content-Encoding' I think they meant that as the response is processed its encoding before process_repsonce
will be different than what it leaves as so the value of 'Content-Encoding' would then be wrong wrt to the encoding that is returned in the processed response. Could you possibly confirm I'm understanding that correctly?
Second, the use of custom metadata in the responsethat is being discussed but then deferring to what @kmike brought up above of having there be a flag placed in the subsequent response.
To do that it looks like I need to use Responce.replace(self, *args, **kwargs)
which is already used in HttpCompressionMiddleware.process_responce
. So was thinking I could add a new entry that corresponds to the proper 'Content-Encoding' into the method variable that is ultimately passed to the response.replace call.
This should be up to date if seeing the code makes this easier to understand.
does this still need doing, do you want me to pick this up? Or for that fact any other "good first issue"? if just let me know seems like there are some open PR's however most of the are abandoned
There is already https://github.com/scrapy/scrapy/pull/4025
That said, it has been a while, though, and @VorBoto might have moved on. Maybe someone could look into resuming their work.
@Gallaecio that's what I saw, I can pick up their branch and finish it up )
Sounds good to me. Thanks!
Does this issue still need to be fixed or has it been resolved?
Still pending, with 3 open PRs. It may not be as trivial as it sounds.
See https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcompression.py#L36
IMHO the "Content-Encoding" header should get preserved, since the spider probably wants to see all the original response headers.