Closed bearloga closed 5 months ago
@geohci asked me
any initial ideas on whether it's possible to effectively auto-detect internal access vs. just make it a parameter that can be set?
One option is to look for presence of "wmnet" in host name (e.g. "stat1007.eqiad.wmnet" is obtained in Python via socket.getfqdn()
). I don't think hardcoding "wmnet" (and the destination) is preferable here, although it would auto-optimize a whole lot of existing code. (Would also require the user to unset http_proxy
and https_proxy
environment variables or else api-ro-discovery.wmnet
would be treated as external, unless the logic also unsets those, but that's A LOT of assumptions to hardcode). The parameterized approach would be better.
A single additional parameter should, in theory, be sufficient. So mwapi.Session(host='https://wikidata.org')
would become mwapi.Session(host='https://api-ro.discovery.wmnet', http_host='https://wikidata.org')
(for example) and the only extra step internally is to set self.headers['Host'] = http_host
if it's not None
(the default).
In either case the user would have to disable HTTP/HTTPS proxy.
Currently there is no way to customize requests used by mwapi.Session so that they go to
https://api-ro.discovery.wmnet
and the HTTP Host header is set tohost
(as provided by user when creating a Session).
What about
session = mwapi.Session(host='https://api-ro.discovery.wmnet')
session.headers['Host'] = 'www.wikidata.org'
? (Setting session.session.headers
, i.e. on the internal requests session, also works.)
$ python 2>/dev/null <<EOF
import mwapi
session = mwapi.Session(host="https://www.wikidata.org")
session.headers["Host"] = "en.wikipedia.org"
print(session.get(action="query", meta="siteinfo")["query"]["general"]["servername"])
EOF
en.wikipedia.org
Thanks @lucaswerkmeister!
There have been some updates on the infra side like MW on k8s so the complete solution ends up being:
import os
os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
import mwapi
session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'
For more info:
Since mwapi is heavily used by WMF's researchers and data analysts on WMF's analytics cluster, it would be enormously beneficial to have a feature which optimizes requests when used internally.
Currently there is no way to customize requests used by mwapi.Session so that they go to
https://api-ro.discovery.wmnet
and the HTTP Host header is set tohost
(as provided by user when creating a Session).