python / docsbuild-scripts

scripts for building documentation on docs.python.org
64 stars 58 forks source link

Speed up CDN purges by re-using a pool manager #184

Closed hugovk closed 2 months ago

hugovk commented 3 months ago

Helps https://github.com/python/docsbuild-scripts/issues/169.

Looking at the server logs (see en-3.12.log) for building the en/3.12 documentation, I noticed that during the publishing stage, there are 1,612 instances of purging URLs from the CDN. It does both the /3/ and /3.12/ URLs.

For example, it begins:

2024-08-11 08:33:44,746 DEBUG en/3.12: Purging https://docs.python.org/3/ from CDN
2024-08-11 08:33:44,747 DEBUG en/3.12: Starting new HTTPS connection (1): docs.python.org:443
2024-08-11 08:33:44,796 DEBUG en/3.12: https://docs.python.org:443 "PURGE /3/ HTTP/1.1" 200 53
2024-08-11 08:33:44,798 DEBUG en/3.12: Purging https://docs.python.org/3.12/ from CDN
2024-08-11 08:33:44,800 DEBUG en/3.12: Starting new HTTPS connection (1): docs.python.org:443
2024-08-11 08:33:44,841 DEBUG en/3.12: https://docs.python.org:443 "PURGE /3.12/ HTTP/1.1" 200 53
2024-08-11 08:33:44,843 DEBUG en/3.12: Purging https://docs.python.org/3/.buildinfo from CDN
2024-08-11 08:33:44,844 DEBUG en/3.12: Starting new HTTPS connection (1): docs.python.org:443
2024-08-11 08:33:44,882 DEBUG en/3.12: https://docs.python.org:443 "PURGE /3/.buildinfo HTTP/1.1" 200 52
2024-08-11 08:33:44,885 DEBUG en/3.12: Purging https://docs.python.org/3/about.html from CDN
2024-08-11 08:33:44,886 DEBUG en/3.12: Starting new HTTPS connection (1): docs.python.org:443
2024-08-11 08:33:44,926 DEBUG en/3.12: https://docs.python.org:443 "PURGE /3/about.html HTTP/1.1" 200 53

And ends:

2024-08-11 08:35:00,565 DEBUG en/3.12: Purging https://docs.python.org/3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 from CDN
2024-08-11 08:35:00,566 DEBUG en/3.12: Starting new HTTPS connection (1): docs.python.org:443
2024-08-11 08:35:00,606 DEBUG en/3.12: https://docs.python.org:443 "PURGE /3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 HTTP/1.1" 200 53

This takes 2 minutes and 44 seconds.

Each purge() command is calling requests.request(), which involves starting new HTTPS connections for each and every URL. The "Starting new HTTPS connection" comes from urllib3, the underlying HTTP client used by Requests.

We can do better by creating a Requests "session" which will open a single connection, and re-use it by calling session.request() in purge():

https://requests.readthedocs.io/en/latest/user/advanced/

Or, because Requests is essentially wrapping the urllib3 HTTP client, we might as well use that urllib3 directly instead of Requests. The API is virtually the same.

Comparison

Re-using a connection makes it about three times quicker on my machine.

Timings from my Mac using the following test scripts.

Original purge(): 33 seconds

import logging
from build_docs import purge

PATHS = [  # the 1,612 paths
    "3/",
    <snip>
    "3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2",
]

logging.basicConfig(level=logging.DEBUG)

for path in PATHS:
    purge(path)

With requests.Session(): 11 seconds

import logging
import requests  # NEW
from build_docs import purge

PATHS = [  # the 1,612 paths
    "3/",
    <snip>
    "3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2",
]

logging.basicConfig(level=logging.DEBUG)

session = requests.Session()  # NEW

for path in PATHS:
    purge(session, path)  # CHANGED

Output is like:

DEBUG:root:Purging https://docs.python.org/3.12/archives/python-3.12.0b4-docs-html.tar.bz2 from CDN
DEBUG:urllib3.connectionpool:https://docs.python.org:443 "PURGE /3.12/archives/python-3.12.0b4-docs-html.tar.bz2 HTTP/1.1" 200 54
DEBUG:root:Purging https://docs.python.org/3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 from CDN
DEBUG:urllib3.connectionpool:https://docs.python.org:443 "PURGE /3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 HTTP/1.1" 200 54

With urllib3.PoolManager(): 10 seconds

import logging
import urllib3  # CHANGED
from build_docs import purge

PATHS = [  # the 1,612 paths
    "3/",
    <snip>
    "3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2",
]

logging.basicConfig(level=logging.DEBUG)

http = urllib3.PoolManager()  # CHANGED

for path in PATHS:
    purge(http, path)  # CHANGED

Output is like:

DEBUG:root:Purging https://docs.python.org/3.12/archives/python-3.12.0b4-docs-html.tar.bz2 from CDN
DEBUG:urllib3.connectionpool:https://docs.python.org:443 "PURGE /3.12/archives/python-3.12.0b4-docs-html.tar.bz2 HTTP/1.1" 200 0
DEBUG:root:Purging https://docs.python.org/3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 from CDN
DEBUG:urllib3.connectionpool:https://docs.python.org:443 "PURGE /3.12/archives/python-3.12.3-docs-pdf-a4.tar.bz2 HTTP/1.1" 200 0
hugovk commented 2 months ago

The improvement is even larger on the docs server. For en/3.14: 57s -> 3.5s 🚀

Before: 57s

64871:2024-09-17 12:32:39,535 DEBUG en/3.14: Purging https://docs.python.org/dev/ from CDN
64872:2024-09-17 12:32:39,538 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
64873:2024-09-17 12:32:39,580 DEBUG en/3.14: https://docs.python.org:443 "PURGE /dev/ HTTP/1.1" 200 52
64874:2024-09-17 12:32:39,582 DEBUG en/3.14: Purging https://docs.python.org/3.14/ from CDN
64875:2024-09-17 12:32:39,583 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
64876:2024-09-17 12:32:39,624 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/ HTTP/1.1" 200 53
64877:2024-09-17 12:32:39,626 DEBUG en/3.14: Purging https://docs.python.org/dev/.buildinfo from CDN
64878:2024-09-17 12:32:39,627 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
64879:2024-09-17 12:32:39,668 DEBUG en/3.14: https://docs.python.org:443 "PURGE /dev/.buildinfo HTTP/1.1" 200 52
64880:2024-09-17 12:32:39,670 DEBUG en/3.14: Purging https://docs.python.org/dev/404.html from CDN

[snip 3328 lines]

68209:2024-09-17 12:33:36,345 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-pdf-a4.tar.bz2 HTTP/1.1" 200 52
68210:2024-09-17 12:33:36,347 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-html.tar.bz2 from CDN
68211:2024-09-17 12:33:36,348 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
68212:2024-09-17 12:33:36,389 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-html.tar.bz2 HTTP/1.1" 200 52
68213:2024-09-17 12:33:36,391 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-text.tar.bz2 from CDN
68214:2024-09-17 12:33:36,392 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
68215:2024-09-17 12:33:36,432 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-text.tar.bz2 HTTP/1.1" 200 50
68216:2024-09-17 12:33:36,434 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs.epub from CDN
68217:2024-09-17 12:33:36,436 DEBUG en/3.14: Starting new HTTPS connection (1): docs.python.org:443
68218:2024-09-17 12:33:36,493 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs.epub HTTP/1.1" 200 52

After: 3.5s

36729:2024-09-18 10:39:56,295 DEBUG en/3.14: Purging https://docs.python.org/dev/ from CDN
36730:2024-09-18 10:39:56,296 DEBUG en/3.14: Resetting dropped connection: docs.python.org
36731:2024-09-18 10:39:56,343 DEBUG en/3.14: https://docs.python.org:443 "PURGE /dev/ HTTP/11" 200 0
36732:2024-09-18 10:39:56,344 DEBUG en/3.14: Purging https://docs.python.org/3.14/ from CDN
36733:2024-09-18 10:39:56,347 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/ HTTP/11" 200 0
36734:2024-09-18 10:39:56,347 DEBUG en/3.14: Purging https://docs.python.org/dev/.buildinfo from CDN
36735:2024-09-18 10:39:56,350 DEBUG en/3.14: https://docs.python.org:443 "PURGE /dev/.buildinfo HTTP/11" 200 0
36736:2024-09-18 10:39:56,350 DEBUG en/3.14: Purging https://docs.python.org/dev/404.html from CDN
36737:2024-09-18 10:39:56,353 DEBUG en/3.14: https://docs.python.org:443 "PURGE /dev/404.html HTTP/11" 200 0
36738:2024-09-18 10:39:56,353 DEBUG en/3.14: Purging https://docs.python.org/dev/about.html from CDN

[snip 2221]

38960:2024-09-18 10:39:59,833 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-texinfo.zip from CDN
38961:2024-09-18 10:39:59,835 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-texinfo.zip HTTP/11" 200 0
38962:2024-09-18 10:39:59,835 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-pdf-a4.tar.bz2 from CDN
38963:2024-09-18 10:39:59,837 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-pdf-a4.tar.bz2 HTTP/11" 200 0
38964:2024-09-18 10:39:59,837 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-html.tar.bz2 from CDN
38965:2024-09-18 10:39:59,840 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-html.tar.bz2 HTTP/11" 200 0
38966:2024-09-18 10:39:59,840 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs-text.tar.bz2 from CDN
38967:2024-09-18 10:39:59,842 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs-text.tar.bz2 HTTP/11" 200 0
38968:2024-09-18 10:39:59,842 DEBUG en/3.14: Purging https://docs.python.org/3.14/archives/python-3.14.0a0-docs.epub from CDN
38969:2024-09-18 10:39:59,846 DEBUG en/3.14: https://docs.python.org:443 "PURGE /3.14/archives/python-3.14.0a0-docs.epub HTTP/11" 200 0

Adding this up, 12 languages 3 versions (57-3.5) seconds = ~32 minutes saved for a full build loop. :tada: