mediawiki-utilities / python-mwapi

Simple Python Wrapper around MediaWiki API
http://pythonhosted.org/mwapi
MIT License
31 stars 11 forks source link

Optimized Wikimedia-internal requests #45

Closed bearloga closed 5 months ago

bearloga commented 2 years ago

In T300977#7700803, akosiaris wrote:

In T300977#7700725, Ottomata wrote: BTW, the proper way to access MW APIs from within our networks is to use e.g. https://api-ro.discovery.wmnet and set the HTTP Host header to the domain of the site you want to access, e.g. www.wikidata.org.

Then you don't need a proxy.

It's not just a BTW unfortunately. There's multiple benefits to not going through the web proxy to reach the MW APIs, e.g.

  • avoiding artificially polluting the organically (that's human user traffic) populated cache of the reverse proxies,
  • avoiding obfuscating logs with IPs that do not belong to the internal endpoint that actually talks to the MW API (or other endpoints),
  • avoiding a SPOF (there aren't that many web proxies nor is it a highly available setup cause there isn't any need to),
  • avoiding saturating a host that is offering this service along side other services.
  • avoiding the webproxy's own cache
  • avoiding another 4 intermediaries (1 outgoing proxy, 1 tls terminator, 2 reverse proxies) in the path to the desired content
  • avoiding adding latency to requests
  • probably others that I am missing.

Since mwapi is heavily used by WMF's researchers and data analysts on WMF's analytics cluster, it would be enormously beneficial to have a feature which optimizes requests when used internally.

Currently there is no way to customize requests used by mwapi.Session so that they go to https://api-ro.discovery.wmnet and the HTTP Host header is set to host (as provided by user when creating a Session).

bearloga commented 2 years ago

@geohci asked me

any initial ideas on whether it's possible to effectively auto-detect internal access vs. just make it a parameter that can be set?

One option is to look for presence of "wmnet" in host name (e.g. "stat1007.eqiad.wmnet" is obtained in Python via socket.getfqdn()). I don't think hardcoding "wmnet" (and the destination) is preferable here, although it would auto-optimize a whole lot of existing code. (Would also require the user to unset http_proxy and https_proxy environment variables or else api-ro-discovery.wmnet would be treated as external, unless the logic also unsets those, but that's A LOT of assumptions to hardcode). The parameterized approach would be better.

A single additional parameter should, in theory, be sufficient. So mwapi.Session(host='https://wikidata.org') would become mwapi.Session(host='https://api-ro.discovery.wmnet', http_host='https://wikidata.org') (for example) and the only extra step internally is to set self.headers['Host'] = http_host if it's not None (the default).

In either case the user would have to disable HTTP/HTTPS proxy.

lucaswerkmeister commented 2 years ago

Currently there is no way to customize requests used by mwapi.Session so that they go to https://api-ro.discovery.wmnet and the HTTP Host header is set to host (as provided by user when creating a Session).

What about

session = mwapi.Session(host='https://api-ro.discovery.wmnet')
session.headers['Host'] = 'www.wikidata.org'

? (Setting session.session.headers, i.e. on the internal requests session, also works.)

$ python 2>/dev/null <<EOF
import mwapi
session = mwapi.Session(host="https://www.wikidata.org")
session.headers["Host"] = "en.wikipedia.org"
print(session.get(action="query", meta="siteinfo")["query"]["general"]["servername"])
EOF
en.wikipedia.org
bearloga commented 5 months ago

Thanks @lucaswerkmeister!

There have been some updates on the infra side like MW on k8s so the complete solution ends up being:

import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'

import mwapi

session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'

For more info: