Accelerate urls checking using concurrency (asyncio + aiohttp)

SuperKogito commented 3 years ago

Urls are checked using a loop that tests the response of the requests sequentially, which becomes slow for huge websites.

Img source

Alternatively, we can use concurrency to process requests& responses asynchronously and speed up the system.

Img source

I already integrated this concept in my local repo using the asyncio and AIOHTTP libraries and the results look promising. The speed difference is notable based on various blogs (Python and fast HTTP clients, HTTP in Python: aiohttp vs. Requests, Making 1 million requests with python-aiohttp) and so far my tests confirm that. Img source

The new libraries are slightly different from requests and so the following is true:

Different requests response format.
Different exceptions from previous ones.
Some disorder in the printed urls list (asynchronicity).
Need to remove duplicate urls before checking instead of adding urls to a seen set .
Possibly different timeout value needed.

I managed to almost replicate the same features we have in the current version but I will definitely need your feedback. Anyway, these differences bring me to my major question @vsoch : Do you think that we should add this feature as an option --accelerated-run or replace the current implementation with it

vsoch commented 3 years ago

I think this is a great idea! If there is some reason that we'd want to keep the original implementation, then perhaps instead of doing --accelerated-run I would do the inverse and ask the user to specify --serial if they want the slower one (so faster is default). The other detail is with respect to reporting - if these are being run async we would have results printing fairly out of order, and perhaps it would make sense to still break down sets of urls by file (so each file still prints together). We could also do something similar to multiprocessing and make a queue, get the results (that include some key for the filename), and then control the order of the printing by way of that key. We also would want to understand that, by using asyncio, what versions of python are we no longer supporting.

Also, let's put in this change after the work I'm going to do this weekend to update the url arguments and spelling, just so it's easier to manage the different PRs.

vsoch commented 3 years ago

Those plots are beautiful, by the way! If we keep both fast / slow we could include a page in the docs that shows this difference.

SuperKogito commented 3 years ago

Having the fast version as default is a good idea. Atm I kept the urls separated by files thus one wouldn't notice the difference in outputs unless --retry-count > 1 Screenshot from 2021-02-06 00-15-05

The Python versions question is important; I still need to check that. So far I only tested my code on Python 3.7. Of course, I intend to wait for your PR :) I just wanted to discuss this change and get your perspective on it, as it will affect the core of the library. But, I think this is a step in the right direction as it will give this library an advantage on other similar tools.

vsoch commented 3 years ago

Oh yes that's another thing to consider - if we do retry, we can't have it running in parallel (it needs to honor the timeout /delay).

vsoch commented 3 years ago

okay @SuperKogito take it away! I won't do any more work / changes until you have had time to work on this. And no rush! I just started a new job so I'm a bit under :droplet: :laughing:

SuperKogito commented 3 years ago

I will try to do my best :stuck_out_tongue_winking_eye: Congratulations on the new job :tada:

urlstechie / urlchecker-python

Accelerate urls checking using concurrency (asyncio + aiohttp) #50