Adding built-in concurrency

mahmoud commented 10 years ago

As mentioned before, my key area of work on wapiti lately has been toward making it nonblocking and concurrent, while at the same time self-contained (not relying on gevent or Twisted), so that wider audiences can use it. Compiling gevent is a headache, and compiling it on Windows/Mac doubly so. I've grown to appreciate Twisted of late, but the Wikipedia API and Wapiti have a high enough learning curve as it is.

What I ended up doing is embarking on a journey to write a good, concurrent, framework-agnostic HTTP library. Not just a wrapper for broken ones (Started off using requests, got bitten by the feature bloat and lack of core competence, that's why ransom.py exists). Not one that's limited to the client side (though for Wapiti's sake I focused on client stuff first). It's called Hematite and a mostly recent version of it is checked into the branch I just pushed (hematite_integration).

The main philosophical difference here is that old wapiti was work agnostic. That is, it would simply execute "tasks" (i.e., get_current_task()). But eventually I realized that these tasks are pretty much all boiling down to web requests. It would be silly to try to solve data processing/CPU-bound work in this fashion. So the new wapiti embraces its web nature and instead has get_current_responses(). Now:

This returns instances of Hematite's ClientResponse, which can be joined on (aka driven concurrently)
Emphasis on the multiple, it's no longer a single task, but many web requests/responses.

Getting concurrency working in CompoundOperations got me a bit tripped up because, based on approach, there are potentially scheduling issues. The rarified resources are: sockets/connections on the local machine and API courtesy/processing power on the Wiki* side). That said, I'm probably going to take a simple approach. On a personal note, I find it funny that I learned to embrace Wapiti as a DSL relatively early on, but didn't embrace the rest of it as a runtime until just recently.

Back to the point, you can see me toying with the new process in tmp_new_process.py. Comment out some of those pdbs and you'll see it runs pretty smoothly. I haven't enabled all the Hematite features yet, I've just been doing some shakedowns. But now that there's someone technically interested, off whom I can bounce some ideas, I expect I'll make more progress in the near future!

BurntSushi commented 10 years ago

Very nice! I don't really have many comments, although I did take a dive through Hematite. It's quite a bit easier to understand than the Wapiti internals. :-)

Will you be pushing Hematite into its own package? I for one would like an asynchronous HTTP library without pulling in massive dependencies.

BurntSushi commented 10 years ago

Oh, could you maybe elaborate more a bit on CompoundOperations? Is it because they introduce dependencies between requests/responses? (I assume Hematite would not help you with that.) I think I'm confused by your mention of resource usage?

BurntSushi commented 10 years ago

Another question: is there something about existing HTTP libraries that does not make them amenable to, say, spawning a new Python thread for each request and using that to achieve concurrency?

mahmoud commented 10 years ago

Hey, glad you like it! Yeah, 90% of the time, I don't actually need a whole evented system, just need to scatter/gather a few URLs and it would be nice to do so without worrying about threading issues as much.

Generally speaking (not specific to Python yet), when it comes to strictly I/O bound stuff, many threads are fine. Probably not what you want for more than a hundred concurrent requests, but it would work. It takes time to spin up threads, not quite as much as processes, but it's significant. The real reason threads are a hassle is because they introduce a lot of complexity and global state, and the consequences can be pretty dire. Of course, getting a bit more Python-specific, there's the GIL, but that's not as big of a deal as people make it out to be, at least just for basic networking; most stuff is I/O bound and releases the GIL, and the heavier CPU-bound stuff, like SSL and gzip also release the GIL.

All that said, Hematite is meant to be compatible with many concurrency patterns. Twisted, gevent, threads, and more. And yes, it will be getting its own package very soon.

mahmoud commented 10 years ago

As far as CompoundOperations go, let's look at an example. Say, GetCategoryTemplates, a hypothetical operation that takes a category and returns a list of templates used by articles in that category. A CompoundOperation with a chain of GetCategory and GetTemplates. Now instantiate it:

gct = GetCategoryTemplates(['Physics', 'Math'], limit=100)

So the goal is to get 100 templates used by articles in the Physics and Math categories. (We're leaving out recursive category members and namespacing). Now in this pipeline, there's like a waterfall of parameters flowing through the system, and we want to optimize for what gets us to that 100 limit fastest.

Previously, without concurrency, we would have just done Physics, then Math, but now we can do both at the same time. But if the first results from Physics are sufficient to hit that 100 limit, then I just wasted time (not to mention system resources, local and remote) fetching Math articles. That's the crux of what I meant by scheduling: The scheduling approach internal to Wapiti. Also I feel that a well-behaved client wouldn't eat up thousands of sockets just because it could.

There are already some artifacts of this in operations/category.py, you can see the Tune() decorator being used. (If you look into Tune's implementation, ignore the fancy metaclass stuff and just remember that all it is trying to do is act like a passthrough).

The other thing I wanted to do with CompoundOperations was to somehow enable each parameter to result in say, TWO web requests, one to this API and another to that API, and then only register complete when both came back and are combined. This is necessary for some inconsistent APIs that return anemic results that don't quite fit in with the rest of the models (I think search falls into this group). But that's a discussion for after hematite_integration gets merged. :)

mahmoud / wapiti

Adding built-in concurrency #25