mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

Rewrite user agent (URL fetching) code to Python #178

Closed pypt closed 6 years ago

pypt commented 7 years ago

Issue to track rewriting user agent (LWP::UserAgent) to Python.

It goes without saying that Media Cloud relies heavily on fetching stuff from the web. Currently we use LWP::UserAgent to do the job, but we want to move to using Python's requests so that the new code which wants to fetch some stuff from web as well could be written directly in Python.

Feature branch is web_useragent_python.

Tasks:

pypt commented 7 years ago

Covered existing LWP::UserAgent-based implementation with unit tests in https://github.com/berkmancenter/mediacloud/tree/web_useragent_python, time to rewrite it to Python now.

pypt commented 6 years ago

Good news - UserAgent class rewrite to Python is done in [web_useragent_python branch](), tested (both automatically and manually) and ready to go.

Summary:

Backstory for the unitiated (@ColCarroll, partially @rahulbot):

We want to gradually rewrite our core codebase from Perl to Python because of various reasons.

Our chosen approach of the rewrite is to do it gradually, rewriting pieces of code from the bottom up while removing some obsolete code at the same time, until we reach the dreamed of 100% Python codebase. We rewrite the codebase class-by-class and function-by-function, making both Perl and Python code use the very same class or function rewritten to Python.

In essence, Media Cloud is a piece of software which downloads some stuff from the web, stores it, fetches it back at some point and then mingles with said stuff in various ways. Because of that, the bottom-most code for the "bottom-up approach is":

After rewriting those three, we can finally start rewriting the already implemented "mingle with stuff" code, also this opens us a way to write new mingling with stuff code in Python directly, e.g. if all three of the classes above were in Python, the recently deployed CLIFF / NYTLabels fetchers + taggers could have been implemented in Python directly.

While one can use requests directly to fetch some stuff here and there, I highly recommend for us to continue using the less-syntax-sugary but more adapted to our "business" needs MediaWords::Util::Web::UserAgent (Perl) / UserAgent (Python) class whenever we can, at least for fetching random stuff from the unpredictable deep web, because it (among other things):

The best reference on how UserAgent works is perhaps its unit test. It closely follows how Perl's LWP::UserAgent is implemented (see HTTP::Request, HTTP::Response and HTTP::Message too).

pypt commented 6 years ago

Will deploy tomorrow unless something else comes up.

pypt commented 6 years ago

Oh, I rewrote HashServer to Python too, so now we can test fetching stuff from the web using both Perl and Python.

The rewritten version is more advanced in that it creates forks for every request, so one can test stuff such as parallel_get(), timeouts, crashes etc.

pypt commented 6 years ago

Deployed. There might be some small, easily fixable bugs left here and there (e.g. related to UTF-8 encoding), but in general everything seems to work (e.g. see Japanese query for "North Korea" in Dashboard).

Fixes #185.