psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.73k stars 976 forks source link

Add support for an async api of the package #77

Closed oldani closed 6 years ago

oldani commented 6 years ago

When scraping we look for performance when doing it on a large scale, asyncio makes improvements on this, and since this library is >python3.6 support, we could implement this without hack around.

Since the project it's already quite used by a lot of people, the idea is that anyone can use the package in async and sync ways. How to support both versions without duplicate codebase tens of debates. So like the codebase on this case is not that large what I think we could do is rewrite everything in async and then add wrappers to the API for sync support, then users in sync mode can use the library like normal even though behind the scene it will be running asynchronously; I have achieved this in others projects creating a sync fn wich within call the async version inside a decorator fn that handles the loop and any other async stuff.

Since this library depends on request which does not support async yet I see two options if we choose the way proposed above, keep using requests and running it in a ThreadPoolExecutor (yet this won't allow actually hight concurrency) or use aiohttp which interface is barely similar to requests.

@kennethreitz let me know what you think.

kennethreitz commented 6 years ago

+1 ThreadPoolExecutor

See https://github.com/requests/requests-threads

oldani commented 6 years ago

Hey @kennethreitz

I started by adding the async session #101 , I ran some tests locally and it's a bit faster (for 100 requests sync: 30s, async: 25s (with uvloop 22s). When requests get async support I can result even faster.

Keep this issue open so I will continue with the HTMLResponse async interface. Don't know if you already want to include the async session in the docs if it gets past the PR.