jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[Update Center] Fix HTTP/404 errors due to broken links in HTML listing pages and missing `?uctest` endpoint #4311

Closed dduportal closed 3 weeks ago

dduportal commented 1 month ago

As described in https://github.com/jenkins-infra/helpdesk/issues/2649#issuecomment-2380569628, the HTML files generated by jenkins-infra/update_center2 are using relative links.

It used to be a good technique when dealing with both domains updates.jenkins-ci.org and updates.jenkins.io in the past when they both served files.

But it is now an issue in the context of the new Update Center system which uses HTTP(S) mirrors to serve content to end users to:

Examples of pages:

dduportal commented 1 month ago

Comment by @daniel-beck about the bandwidth in a discussion we got together on this topic:

By @dduportal I don't recall the exact amount of data transferred but it was huge even for these tiny HTML files. We're speaking about Tbs per month (globally, it's 50 Tb per month)

Did you just group by file extension, or also path? Because some of the "JSON" files also have an HTML file extension. So > if you count https://updates.jenkins.io/update-center.json.html as HTML, that'll skew this a lot.

=> Important point as it means we could have to change the routing pattern.

Cloudflare Analytics shows that HTML was far behind in amount of requests but we can't tell the different HTML files appart:

Capture d’écran 2024-09-28 à 10 26 45
dduportal commented 1 month ago

Proposal: Given the context of the new Update Center, let's use absolute URL links.

What are your thoughts on this @daniel-beck @timja @MarkEWaite ?

timja commented 1 month ago

Absolute URL makes sense to me.

daniel-beck commented 1 month ago

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

dduportal commented 1 month ago

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type? How does that make any sense?

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

It doesn't look like we understand enough what's going on here to base any decisions on.

We understand the mirroring mechanism which is why i opened this issue. If we start to select files which are mirrored vs which one are not, the architectural complexity will be a pain as we will need to maintain a list of conditions. It is already nightmare-ish on get.jenkins.io tbh

hence the question about pros and cons of switching to absolute URLs which is non mutually exclusive with analysing usage to understand better.

the costs involved here are huge compared to optimization: but it is mandatory to have a finer grain of understanding

dduportal commented 1 month ago

Hello @daniel-beck 👋

Cloudflare Analytics shows that HTML was far behind in amount of requests

It's by far the most popular content type?

My apologies, I mistakenly used the word "behind". You are correct, I meant that HTML seems to be, by far, the most popular type of file downloaded, at least as per the Cloudflare dashboard during the 24 hours experiment.

Let me check if we see the same result on the current VM (analysing the logs from a few days ago).

How does that make any sense?

I don't know. Let's compare with current behavior. That could also be "assumed" content type (including HTTP/404) as they are served as HTML as well.

Is this just the tool installers via DownloadService or are we still downloading the update-center.json.html from Jenkins?

I ... don't know. We did not even know there was an HTML version of this one. Where should we look (except our access logs)?

dduportal commented 1 month ago

Initial check for the 09 October 2024 (both HTTP and HTTPS, both updates.jenkins-ci.org and updates.jenkins.io vhosts):

Report (generated with GoAccess from the "combined" access log):

report.html.zip

dduportal commented 1 month ago

@daniel-beck If we compare with Cloudflare numbers for 24 hours, which are only HTTP/2XX and HTTP/4XX (as the redirects are NOT sent to Cloudflare), it maps:

Need to check the repartition HTML/JSON on the current production, but the high rate of HTTP/4XX clearly explains the ratio change during the brownout.

It also adds more weight in using an absolute URL in the HTML generated files to decrease this amount of HTTP/4XX.

daniel-beck commented 1 month ago

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

1/3 of JSON, and 2/3 of HTML

The problem with this view is that there are different kinds of HTML files on this domain.

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Various update-center.json.html exist and are irrelevant for this topic. Half the tool installer files (e.g. in https://updates.jenkins.io/updates/ ) are HTML files and are irrelevant for this topic.

dduportal commented 1 month ago

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access20241003gz. Got 4 files (unsecured and secured, for both hostnames)

dduportal commented 1 month ago

Some of the data in the report makes no sense at all. Could you point me to the raw access logs? I want to check a few things.

the report was generated from the access logs on the pkg machine. I used the gzipped logs with the name pattern access_20241003_gz. Got 4 files (unsecured and secured, for both hostnames)

Additions:

dduportal commented 1 month ago

The ones that this issue is about (those in https://updates.jenkins.io/download/ ) are never used programmatically unless someone's wget --recursive goes brrrr.

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

daniel-beck commented 1 month ago

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted
dduportal commented 1 month ago

As the log demonstrates, the HTML files discussed in this issue are completely irrelevant for traffic.

The most popular URL that this issue is about is accessed just 24 times across the 4 logs:

  24 /download/plugins/htmlpublisher/

Compared to:

508498 /updates/hudson.tasks.Maven.MavenInstaller.json.html
387857 /updates/hudson.tasks.Ant.AntInstaller.json.html
339259 /updates/hudson.plugins.gradle.GradleInstaller.json.html
334649 /updates/hudson.tools.JDKInstaller.json.html

Methodology (prove me wrong):

cat updates.jenkins*/access*.log.20241003000000 | fgrep 'GET ' | sed 's|.*GET ||g' | sed -E 's|\?.*||g' | sed -E 's| .*||g' > access-combined.log.20241003000000
sort access-combined.log.20241003000000 > access-combined.log.20241003000000.sorted
uniq -c access-combined.log.20241003000000.sorted > access-combined.log.20241003000000.sorted.uniqed
sort -nr access-combined.log.20241003000000.sorted.uniqed > access-combined.log.20241003000000.sorted.uniqed.sorted

Yes, I had the same results before generating the goaccess. I fail to understand the relationship with the current issue: the domain change when serving files from mirrors leads to wrong hyperlinks in the generated pages. what did I miss?

daniel-beck commented 1 month ago

Yes, but we are loosing track of the initial problem: using absolute URL in the links of these specific HTML files. Because the mirror system architecture ends up with these files server by another domain than updates.jenkins.io due to redirections.

I wonder whether this is necessary. Seems like mirrors make sense for anything that's actual "content" (the stuff being downloaded), not glorified directory indexes.

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used?

What did I miss?

This came from https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2384923753 / https://github.com/jenkins-infra/helpdesk/issues/4311#issuecomment-2416879452

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

dduportal commented 1 month ago

I'm not sure to understand the relationship with the access logs or usage types: we clearly understand the problem for these specific files. Unless you want to check the usage for actions (blockers or optimizations) if the wget --recursive is used? What did I miss?

This came from #4311 (comment) / #4311 (comment)

Basically the numbers you presented did not align with what I expected usage to look like. Looking at the actual logs shows reality lines up with my expectations :)

Oh i see, thanks for clarifying. We agree then on the result from the current production.

Let me compile my thoughts and analysis on the Cloudflare part:

@smerle33 did propose to use non Cloudflare mirror as a safety net if things goes south with CF. It would use a custom webserver we manage (or two) and hosted in DigitalOcean (we have 4-5 Tb bandwidth for free and 15k credits valids until end of year) so we can check access logs in details. Cost is OK for another brownout (assuming 2 to 3 Tb of download for 24h), but we'll need to be careful if we add it permanently.

daniel-beck commented 4 weeks ago

I met with @dduportal to move this topic along. Outcome:

dduportal commented 4 weeks ago

I met with @dduportal to move this topic along. Outcome:

* He's looking into continuing to serve download link/index files from updates.jenkins.io, probably involving migrating `RedirectMatch` to `RewriteRule` in the uc2 `.htaccess` file due to how weird Apache is, if that's reasonably straightforward to accomplish. This prevents users from linking/bookmarking to "implementation detail" hostnames.

* I look into making URLs in `--download-links-directory` and `--latest-links-directory` absolute instead of relative, independent of the outcome of your task. This is implemented in [Use absolute URLs for links from download indexes update-center2#810](https://github.com/jenkins-infra/update-center2/pull/810)

Following this summary, I've opened the PR https://github.com/jenkins-infra/update-center2/pull/812 to focus on the second solution.

With the use of RewriteRule for the "fallback" rule (tested with success), we can add a rewrite condition to test the absence of a file: that would allow us to server the /downloads/**/*html file from Apache since it's only a low volume, and would solve the HTTP/404 links without requiring absolute links.

dduportal commented 3 weeks ago

Update:

dduportal commented 3 weeks ago
# Before the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 307 
date: Tue, 22 Oct 2024 09:54:37 GMT
content-type: text/html; charset=iso-8859-1
location: https://mirrors.updates.jenkins.io/uctest.json?uctest
strict-transport-security: max-age=2592000; includeSubDomains; preload

# After the change
$ curl -I "https://azure.updates.jenkins.io/foo/update-center.json?uctest"
HTTP/2 200 
date: Tue, 22 Oct 2024 09:55:06 GMT
content-type: application/json
content-length: 3
last-modified: Tue, 22 Oct 2024 09:54:46 GMT
etag: "3-6250dc26ce6f7"
accept-ranges: bytes
strict-transport-security: max-age=2592000; includeSubDomains; preload