Closed pypt closed 6 years ago
Covered existing LWP::UserAgent-based implementation with unit tests in https://github.com/berkmancenter/mediacloud/tree/web_useragent_python, time to rewrite it to Python now.
Good news - UserAgent
class rewrite to Python is done in [web_useragent_python
branch](), tested (both automatically and manually) and ready to go.
Summary:
Did some minor refactoring:
MediaWords::Util::Web::UserAgent
wrapper instead of plain LWP::UserAgent
.UserAgent
now has get_follow_http_html_redirects(url: str) -> Response
method which will follow all HTML redirects too (<meta />
refreshes, archive.org, archive.is, linkis.com, alarabiya hacks).parallel_get()
to a method under UserAgent()
.Covered pretty much 100% of ::UserAgent
s behavior with Perl unit tests.
Implemented UserAgent
class in Python which uses requests
behind the scenes.
Rewrote UserAgent
's Perl unit tests to Python.
Made Perl's MediaWords::Util::Web::UserAgent
into a proxy package which uses Python's UserAgent
class behind the scenes.
Fixed Recognize Chinese chars in sentence extraction because requests
decodes GB2312 properly.
Rewrote encode_json()
and decode_json()
from MediaWords::Util::JSON::
to Python.
Added MediaWords::Util::URL::urls_are_equal()
(implemented in Python as urls_are_equal(url1: str, url2: str) -> bool
) to compare URLs (i.e. urls_are_equal('http://localhost:80///', 'HTTP://LOCALHOST') is True
).
Added is_urls()
/ isnt_urls()
Perl test helpers to MediaWords::Test::URLs
so now one can:
use MediaWords::Test::URLs;
sub test_urls()
{
is_urls( 'http://localhost', 'http://localhost/', 'URLs are expected to be equal' );
}
Backstory for the unitiated (@ColCarroll, partially @rahulbot):
We want to gradually rewrite our core codebase from Perl to Python because of various reasons.
Our chosen approach of the rewrite is to do it gradually, rewriting pieces of code from the bottom up while removing some obsolete code at the same time, until we reach the dreamed of 100% Python codebase. We rewrite the codebase class-by-class and function-by-function, making both Perl and Python code use the very same class or function rewritten to Python.
In essence, Media Cloud is a piece of software which downloads some stuff from the web, stores it, fetches it back at some point and then mingles with said stuff in various ways. Because of that, the bottom-most code for the "bottom-up approach is":
DatabaseHandler
class – "store stuff, fetch stuff" – rewritten to Python some time agoUserAgent
class – "download some stuff" – just rewritten to PythonKeyValueStore::
classes – "store stuff, fetch stuff" – next in line to get rewrittenAfter rewriting those three, we can finally start rewriting the already implemented "mingle with stuff" code, also this opens us a way to write new mingling with stuff code in Python directly, e.g. if all three of the classes above were in Python, the recently deployed CLIFF / NYTLabels fetchers + taggers could have been implemented in Python directly.
While one can use requests
directly to fetch some stuff here and there, I highly recommend for us to continue using the less-syntax-sugary but more adapted to our "business" needs MediaWords::Util::Web::UserAgent
(Perl) / UserAgent
(Python) class whenever we can, at least for fetching random stuff from the unpredictable deep web, because it (among other things):
User-Agent
and From
HTTP request headers.UserAgent
object which is required to fetch content from some websites.<meta>
).requests
does by default.The best reference on how UserAgent
works is perhaps its unit test. It closely follows how Perl's LWP::UserAgent
is implemented (see HTTP::Request
, HTTP::Response
and HTTP::Message
too).
Will deploy tomorrow unless something else comes up.
Oh, I rewrote HashServer
to Python too, so now we can test fetching stuff from the web using both Perl and Python.
The rewritten version is more advanced in that it creates forks for every request, so one can test stuff such as parallel_get()
, timeouts, crashes etc.
Deployed. There might be some small, easily fixable bugs left here and there (e.g. related to UTF-8 encoding), but in general everything seems to work (e.g. see Japanese query for "North Korea" in Dashboard).
Fixes #185.
Issue to track rewriting user agent (
LWP::UserAgent
) to Python.It goes without saying that Media Cloud relies heavily on fetching stuff from the web. Currently we use
LWP::UserAgent
to do the job, but we want to move to using Python'srequests
so that the new code which wants to fetch some stuff from web as well could be written directly in Python.Feature branch is
web_useragent_python
.Tasks:
LWP::UserAgent
intoMediaWords::Util::Web::UserAgent
to establish the user agent's API that we're usingHTTP::HashServer
toMediaWords::Test::HTTP::HashServer
MediaWords::Test::HTTP::HashServer
to Python to make it easier to test user agent code against a test web serverparallel_get()
into a method of theMediaWords::Util::Web::UserAgent
MediaWords::Util::Web::UserAgent
as all the requests would probably like to employ this hackMediaWords::Util::Web::UserAgent
with Perl unit tests to help retain the same expected behavior after the Python rewrite:User-Agent:
headerFrom:
header::Determined
of specific HTTP status codes::Determined
, plus custom timingdata/logs/
GET
POST
request()
: customMETHOD
request()
: custom headersrequest()
: custom content typerequest()
: custom content (POST
data) -- both hashref and stringrequest()
: authorizationget_string()
get()
andpost()
userequest()
internally; runfix_common_url_mistakes()
,is_http_url()
, add HTTP auth every timeMediaWords::Util::Web::UserAgent
and its tests to Python