scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

HTTP 2 support #1854

Closed povilasb closed 3 years ago

povilasb commented 8 years ago

I peeked at the docs, at the issues but couldn't find any info about HTTP 2 support.

Does scrapy support it?

redapple commented 8 years ago

@povilasb , no, scrapy only supports HTTP/1.0 and HTTP/1.1 (we currently use twisted.web.client.Agent)

pawelmhm commented 8 years ago

Looking into Twisted I found out there is work on adding HTTP 2 support to Twisted.web: https://twistedmatrix.com/trac/ticket/7460 Once this is merged upstream do you think Scrapy should also follow suit and add HTTP 2 support?

mguerreiro commented 7 years ago

Up, this has been merged :)

kmike commented 7 years ago

It seems the ticket is for HTTP2 server support, not for client. There is an example of twisted http2 client in python-hyper docs: https://python-hyper.org/projects/h2/en/stable/twisted-post-example.html

pawelmhm commented 7 years ago

it seems like work on Twisted client is not making much progress now, but this example you linked @kmike doesn't look terribly complicated so maybe we could just add h2 as Scrapy requirement and write our own HTTP2 Twisted-Scrapy client?

AnshulMalik commented 7 years ago

Hi, I am a student from India, planning to participate in GSoC this year. There was a project regarding "New HTTP/1.1 download handler", this was required because some issues with twisted API, but they have recently resolved those issues (@redapple told me). Then I realized that scrapy doesn't have support for HTTP/2 so why not choose this as GSoC project.

I have also done a bit of research about twisted implementing HTTP/2, python-hyper which is nice implementation of HTTP/2, twisted also uses h2 for supportng for HTTP2. Currently, Twisted only have support for HTTP2 on server side ( @kmike )

As @pawelmhm mentioned, one way is to use h2 and start the work. Since we don't know when Twisted will add support for HTTP2 client, so we should write our client.

I am seeking your reviews for this idea, If I should do it, what things should I keep in mind, where can I find more info about HTTP2. And most important, is this idea worth? (Twisted might add support for HTTP2 in future)

alexeyqu commented 6 years ago

Hi, I'm a 5th year CS student from MIPT, Russia, planning to participate in GSoC this year. I write in Python for over 3 years and teach it at MIPT, more detailed CV is here.

This year, the HTTP/2 support is mentioned in the ideas list. On the low level, Scrapy uses twisted module for HTTP connections. After looking at Twisted and Hyper repos, I realized that in terms of HTTP/2 support on the client side, nothing has changed since @AnshulMalik mentioned it here (except for this gist with simple HTTP/2 client, but it was around for a long time). That, obviously, give some points to the "use h2 and start the work" approach.

I wonder, is there a reason for silence on this topic (both here and in Twisted)? (I don't have any experience with HTTP/2 internals yet, so maybe I'm missing something crucial) I've just started familiarizing myself with Scrapy codebase, could you suggest some good first bugs to fix?

Upd: since this issue is really old and hard to find, I'm going to summon the 2 possible mentors here: @dangra @lopuhin .

kmike commented 6 years ago

Hey @alexeyqu! That's a feature which could make Scrapy more resource-efficient in some specific cases, not a bug which affects day-to-day work - probably that's the reason it gets less attention.

Glennvd commented 6 years ago

Hey @kmike, I'm afraid I disagree. I'm seeing an increase in websites (using a specific vendor) that detect bot traffic based on HTTP2/HTTP1.1 version compared to what's the expected default from your user agent. While I do understand that that does not make it a priority for the Twisted team, I'm sure it has a significant impact for a lot of scrapy users.

sakshamb2113 commented 4 years ago

@Gallaecio I am interested in this issue.Can I work on this ?

adityaa30 commented 4 years ago

@Gallaecio @wRAR Hey. I am a 3rd year CS student from NITT, India. I am planning to participate in GSOC this year.

Upd: After some googling I feel the example provided in hyper-h2 docs as mentioned by @kmike can be used as base code and we can go for "use h2 and start work" approach?

Gallaecio commented 4 years ago

@adityaa30 Sounds good to me :slightly_smiling_face:

adityaa30 commented 4 years ago

@Gallaecio Thanks for the reply. I have been understanding the workflow of http11.py and wanted to know what are the key features that I should be focusing on?

Gallaecio commented 4 years ago

Well, I guess it would be best to start by getting a working implementation, something that allows using HTTP/2.

When that works, the next step would be to add support for what our HTTP 1.1 implementation currently supports. Mostly that the settings that may be used to modify how HTTP 1.1 works also work similarly in HTTP/2 (where possible).

I think that would make the project complete. Extra goals may involve exposing additional features of HTTP/2 through settings, if they can be useful for web scraping. I wonder, for example, if it would make sense to allow users to limit concurrent requests per TCP connection; I’m not sure, though.

adityaa30 commented 4 years ago

@Gallaecio Thanks. I am working on a demo implementation now.

adityaa30 commented 4 years ago

@Gallaecio @wRAR I have prepared a draft of my application on Google docs https://docs.google.com/document/d/1AUjXgK9u1QcxqdjJxQObecgHQEKZfzO3U96ivBvblTw/edit?usp=sharing

I would be happy to get a feedback about the proposal before submitting it. The comment rights to the doc are public.

Gallaecio commented 4 years ago

The proposal looks great to me, very detailed :+1:

One topic that it does not cover, though, and one that I think may be specially important once there is HTTP/2 support in Scrapy, is how users will be able to configure which protocol is used. At the moment we have settings like DOWNLOAD_HANDLERS, DOWNLOADER_HTTPCLIENTFACTORY, DOWNLOADER_CLIENTCONTEXTFACTORY, DOWNLOADER_CLIENT_TLS_*. It would be great if you could think how HTTP/2 support fits in the picture, and if we need to make changes here (e.g. settings value changes, new settings, etc.) to support scenarios such as using HTTP/2 when possible but falling back to 1.1 otherwise, or the opposite.

adityaa30 commented 4 years ago

@Gallaecio Thanks a lot for the review.

HTTP/2 support in Scrapy, is how users will be able to configure which protocol is used.

I will work on this now and update my proposal.

adityaa30 commented 4 years ago

@Gallaecio @wRAR I have updated my proposal. If you wouldn't mind taking another look, I would like to get some more feedback. Link to the proposal remains the same.

Gallaecio commented 4 years ago

I cannot think of more feedback at the moment. It looks really thought-out and well put. :+1:

adityaa30 commented 4 years ago

@Gallaecio Thanks a lot.

wRAR commented 4 years ago

I like the proposal @adityaa30 , thank you.

adityaa30 commented 4 years ago

@wRAR Thanks you! 🙂

jseyfert commented 3 years ago

Hey @kmike, I'm afraid I disagree. I'm seeing an increase in websites (using a specific vendor) that detect bot traffic based on HTTP2/HTTP1.1 version compared to what's the expected default from your user agent. While I do understand that that does not make it a priority for the Twisted team, I'm sure it has a significant impact for a lot of scrapy users.

Hi, @Glennvd I have encountered multiple sites that blocked scrapy due to HTTP1.1, but i am finding they are only identifying my spider due to scrapy automatically capitalizing the keys of the headers. I have been able to use a workaround that has worked 100% of the time so far.

example This is what the site expects: accept-encoding: gzip, deflate, br accept-language: en-US,en;q=0.9,hi;q=0.8,pt;q=0.7

And this is what scrapy sends(even if you make them lowercase in the spider): Accept-Encoding: gzip, deflate, br Accept-Language: en-US,en;q=0.9,hi;q=0.8,pt;q=0.7

This seems to be a workaround to keep the keys lowercase:

headers={
    "":"accept-encoding: gzip, deflate, br",
    "":"accept-language: en-US,en;q=0.9,hi;q=0.8,pt;q=0.7",
},

I hope this helps someone:)

GeorgeA92 commented 3 years ago

Hi, @jseyfert.

I have encountered multiple sites that blocked scrapy due to HTTP1.1, but i am finding they are only identifying my spider due to scrapy automatically capitalizing the keys of the headers.

It is related to #2711

ikumar5am commented 5 months ago

HTTP2 experimental support is now added to Scrapy, checkout https://docs.scrapy.org/en/latest/topics/settings.html