Parallelize URL requests

dwillis commented 9 years ago

in Jurisdiction#_clarity_subjurisdiction_url

ghing commented 9 years ago

This gets a little tricky.

The cleanest way to do this would be to use asyncio and a HTTP request library like aiohttp. This is essentially the method described in the blog post Fast scraping in python with asyncio. The problem is that asyncio is only in Python 3.4+. There is a package called Trollius that supports a similar API as asyncio.

Historically, using requests with gevent (see grequests) was the way to go. But, this doesn't work with Python 3.

So, this leaves us with a couple of options:

Use threads. requests-futures seems like the way to go with this. This might be a good reference implementation.
Use asyncio (and fall back to Trollius for older Pythons) and requests in an Executor (see http://stackoverflow.com/questions/22190403/how-could-i-use-requests-in-asyncio)
Use asyncio + aiohttp in Python 3.4+, don't do parallel requests in Python 2.7

Of these, I think I like 1 the best. Asyncio and an asynchronous HTTP requests library would be the most performant way, but I think our goal is just to be faster rather than to be as fast as possible.

dwillis commented 9 years ago

Agreed. #1 is best.

zstumgoren commented 9 years ago

Yep, option 1 is best. I suspect there are still tons of people (myself included) using Python 2.x on a daily basis so wouldn't want to make this a 3.x-only feature.

ghing commented 9 years ago

Another useful reference example, perhaps: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

openelections / clarify

Parallelize URL requests #2