Closed dduportal closed 3 weeks ago
Comment by @daniel-beck about the bandwidth in a discussion we got together on this topic:
By @dduportal I don't recall the exact amount of data transferred but it was huge even for these tiny HTML files. We're speaking about Tbs per month (globally, it's 50 Tb per month)
Did you just group by file extension, or also path? Because some of the "JSON" files also have an HTML file extension. So > if you count https://updates.jenkins.io/update-center.json.html as HTML, that'll skew this a lot.
=> Important point as it means we could have to change the routing pattern.
Cloudflare Analytics shows that HTML was far behind in amount of requests but we can't tell the different HTML files appart:
Proposal: Given the context of the new Update Center, let's use absolute URL links.
updates.jenkins-ci.org
to this domain.getPath()
method when retrieving download URL from getDownloadUrl()
during the HTML building (but NOT when retrieving data from Artifactory or building htaccess files!)What are your thoughts on this @daniel-beck @timja @MarkEWaite ?
Absolute URL makes sense to me.
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type? How does that make any sense?
Is this just the tool installers via DownloadService
or are we still downloading the update-center.json.html
from Jenkins?
It doesn't look like we understand enough what's going on here to base any decisions on.
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type? How does that make any sense?
Is this just the tool installers via
DownloadService
or are we still downloading theupdate-center.json.html
from Jenkins?It doesn't look like we understand enough what's going on here to base any decisions on.
We understand the mirroring mechanism which is why i opened this issue. If we start to select files which are mirrored vs which one are not, the architectural complexity will be a pain as we will need to maintain a list of conditions. It is already nightmare-ish on get.jenkins.io tbh
hence the question about pros and cons of switching to absolute URLs which is non mutually exclusive with analysing usage to understand better.
the costs involved here are huge compared to optimization: but it is mandatory to have a finer grain of understanding
Hello @daniel-beck 👋
Cloudflare Analytics shows that HTML was far behind in amount of requests
It's by far the most popular content type?
My apologies, I mistakenly used the word "behind". You are correct, I meant that HTML seems to be, by far, the most popular type of file downloaded, at least as per the Cloudflare dashboard during the 24 hours experiment.
Let me check if we see the same result on the current VM (analysing the logs from a few days ago).
How does that make any sense?
I don't know. Let's compare with current behavior. That could also be "assumed" content type (including HTTP/404) as they are served as HTML as well.
Is this just the tool installers via
DownloadService
or are we still downloading theupdate-center.json.html
from Jenkins?
I ... don't know. We did not even know there was an HTML version of this one. Where should we look (except our access logs)?
Initial check for the 09 October 2024 (both HTTP and HTTPS, both updates.jenkins-ci.org and updates.jenkins.io vhosts):
~ 8,478,760 hits
~ 444.350 visitors
~5,000,000 redirections (HTTP/3XX) for around 1.2 Gib
~3,200,000 files served (HTTP/2XX) for around 2.1 Tib
~ 257,890 client errors (HTTP/4XX) for around 43 Mib
Report (generated with GoAccess from the "combined" access log):
@daniel-beck If we compare with Cloudflare numbers for 24 hours, which are only HTTP/2XX and HTTP/4XX (as the redirects are NOT sent to Cloudflare), it maps:
Need to check the repartition HTML/JSON on the current production, but the high rate of HTTP/4XX clearly explains the ratio change during the brownout.
It also adds more weight in using an absolute URL in the HTML generated files to decrease this amount of HTTP/4XX.
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
1/3 of JSON, and 2/3 of HTML
The problem with this view is that there are different kinds of HTML files on this domain.
The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive
goes brrrr.
Various update-center.json.html
exist and are irrelevant for this topic. Half the tool installer files (e.g. in https://updates.jenkins.io/updates/ ) are HTML files and are irrelevant for this topic.
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access20241003gz. Got 4 files (unsecured and secured, for both hostnames)
Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.
the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access_20241003_gz. Got 4 files (unsecured and secured, for both hostnames)
Additions:
goaccess
tool on it (specifying combined logs format). The "concatenated" file weight 1.2 Gb: do you want me to send it to you (compressed) through a private channel @daniel-beck to avoid further unneded tasks for you?The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's
wget --recursive
goes brrrr.
Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io
due to redirections.
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files.
Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive
is used?
What did I miss?
As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.
The most popular URL that this issue is about is accessed just 24 times across the 4 logs:
24 /download/plugins/htmlpublisher/
Compared to:
508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html
Methodology (prove me wrong):
cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted
As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.
The most popular URL that this issue is about is accessed just 24 times across the 4 logs:
24 /download/plugins/htmlpublisher/
Compared to:
508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html 387857 /updates/hudson.tasks.Ant.AntInstaller.json.html 339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html 334649 /updates/hudson.tools.JDKInstaller.json.html
Methodology (prove me wrong):
cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000 sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted
Yes, I had the same results before generating the goaccess
. I fail to understand the relationship with the current issue: the domain change when serving files from mirrors leads to wrong hyperlinks in the generated pages. what did I miss?
Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than
updates.jenkins.io
due to redirections.
I wonder whether this is necessary. Seems like mirrors make sense for anything that's actual "content" (the stuff being downloaded), not glorified directory indexes.
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the
wget --recursive
is used?What did I miss?
This came from https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2384923753 / https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2416879452
Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)
I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the
wget --recursive
is used? What did I miss?This came from #4311 (comment) / #4311 (comment)
Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)
Oh i see, thanks for clarifying. We agree then on the result from the current production.
Let me compile my thoughts and analysis on the Cloudflare part:
@smerle33 did propose to use non Cloudflare mirror as a safety net if things goes south with CF. It would use a custom webserver we manage (or two) and hosted in DigitalOcean (we have 4-5 Tb bandwidth for free and 15k credits valids until end of year) so we can check access logs in details. Cost is OK for another brownout (assuming 2 to 3 Tb of download for 24h), but we'll need to be careful if we add it permanently.
I met with @dduportal to move this topic along. Outcome:
RedirectMatch
to RewriteRule
in the uc2 .htaccess
file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.--download-links-directory
and --latest-links-directory
absolute instead of relative, independent of the outcome of your task. This is implemented in https://github.com/jenkins-infra/update-center2/pull/810I met with @dduportal to move this topic along. Outcome:
* He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating `RedirectMatch` to `RewriteRule` in the uc2 `.htaccess` file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames. * I look into making URLs in `--download-links-directory` and `--latest-links-directory` absolute instead of relative, independent of the outcome of your task. This is implemented in [Use absolute URLs for links from download indexes update-center2#810](https://github.com/jenkins-infra/update-center2/pull/810)
Following this summary, I've opened the PR https://github.com/jenkins-infra/update-center2/pull/812 to focus on the second solution.
With the use of RewriteRule
for the "fallback" rule (tested with success), we can add a rewrite condition to test the absence of a file: that would allow us to server the /downloads/**/*html
file from Apache since it's only a low volume, and would solve the HTTP/404 links without requiring absolute links.
Update:
https://github.com/jenkins-infra/update-center2/pull/812 has been tested and then merged with success. No more RedirectMatch
on pkg
VM + it keeps working as expected in the UC in Azure + mirrors.
It unblock the issue here: opened https://github.com/jenkins-infra/update-center2/pull/813 to start serving the HTML files from download/***
from Apache (and the uctest.json 😉 ) instead of mirrors.
?uctest
trick is also working as expected:# Before the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 307
date: Tue, 22 Oct 2024 09:54:37 GMT
content-type: text/html; charset=iso-8859-1
location: https://mirrors.updates.jenkins.io/uctest.json?uctest
strict-transport-security: max-age=2592000; includeSubDomains; preload
# After the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 200
date: Tue, 22 Oct 2024 09:55:06 GMT
content-type: application/json
content-length: 3
last-modified: Tue, 22 Oct 2024 09:54:46 GMT
etag: "3-6250dc26ce6f7"
accept-ranges: bytes
strict-transport-security: max-age=2592000; includeSubDomains; preload
As described in https://github.com/jenkins-infra/helpdesk/issues/2649#issuecomment-2380569628, the HTML files generated by jenkins-infra/update_center2 are using relative links.
It used to be a good technique when dealing with both domains
updates.jenkins-ci.org
andupdates.jenkins.io
in the past when they both served files.But it is now an issue in the context of the new Update Center system which uses HTTP(S) mirrors to serve content to end users to:
Examples of pages: