Closed lemeurherve closed 1 year ago
at one point we enabled --delete on the publish, so it deleted any file that wasn't in the build.
I think plugins jenkinsfile does it.
Amongst several changes proposed by @zbynek this weekend (❤️), here is a PR to add --delete
to blobxfer: https://github.com/jenkins-infra/jenkins.io/pull/6677
I've created a snapshot of the concerned Azure File Storage. Next steps:
[x] Activate daily backups (and check in one and two weeks the associated costs)
Comparison between the fileshare (what's on Fastly) and a clean build, obtained with diff -r -q /cleanbuild/ /fileshare/
Only in clean build
All but one of those can be explained by case-insensitive storage and case-sensitive Git -- renaming in Git doesn't move them in storage. Should not be a problem because both lowercase and uppercase URL works for them. The last one has non-ascii character in filename but seems to have identical name in storage and Git, hopefully also OK.
Only in fileshare:
The removed pipeline step docs mostly belong to suspended plugins. The extension docs are probably missing because of https://github.com/jenkins-infra/helpdesk/issues/3746
The extension docs are probably missing because of https://github.com/jenkins-infra/helpdesk/issues/3746
Should #3746 be resolved first then so we don't loose the corresponding doc?
@lemeurherve I don't think this should be blocked by the extension indexer. ATM those documents exist but are not linked from the index, so users are reporting them as missing anyway.
@lemeurherve I don't think this should be blocked by the extension indexer. ATM those documents exist but are not linked from the index, so users are reporting them as missing anyway.
I agree. The unreferenced extension pages are still included in the Algolia search index (the credentials plugin, for example), but those pages are not always reachable even from the Algolia search.
@kmartens27 is doing a detailed review of the pages that will be deleted in order to be confident that there are no surprises "lurking" in that list. I'd like to allow him a little more time to check those items before this is implemented. A day or two should be sufficient.
They will only show up in the algolia search if they are still linked from somewhere else on jenkins.io
I used ./lychee ./build/_site --config .lychee.toml -f markdown -o lychee.md
to produce https://gist.github.com/halkeye/8f5f0e7d5a558b825845f98cab361e59
I'm not entirely sure why I did it. I think I thought there would be obvious glaring missing links todo with this, but its still worth seeing.
I was able to review the list of items (minus extensions) and found that there are only 3 files that are being found/used on Jenkins.io
[ ] /doc/pipeline/steps/warnings - used in PHP solutions page
[ ] /images/evergreen - technically used in image:/images/evergreen/magician_256.png (BLOG POST FROM SEPTEMBER 2018)
[ ] /projects/evergreen - technically used in blog posts from 2019/2018 (it is using a longer URL that includes /projects/evergreen
everything else was unfound/unavailable and should be okay to delete/remove completely
evergreen has beeen dead for years, i think its safe to kill it. Pretty sure the magician is already on assets, so it could be updated to point there. Or just downloaded from jenkins.io and saved to the repo.
[ ] /doc/pipeline/steps/warnings - used in PHP solutions page
That page only links https://www.jenkins.io/doc/pipeline/steps/warnings-ng/ which is OK, the old warnings plugin was removed.
Links to evergreen from blogposts were mostly replaced by links to GitHub: https://github.com/jenkins-infra/jenkins.io/pull/5315/files -- if you can still see some "live" links they should be replaced the same way.
Thanks a lot everyone for these reviews!
Since everything seems OK, I'm planning to create a backup and activate the --delete
flag on this Thursday 19th of October.
Update: I'm late on this issue, been busy on other tasks. Before proceeding, I'd like to be sure the backup restore works well before merging https://github.com/jenkins-infra/jenkins.io/pull/6677, I'll keep this ticket updated.
I've restored a backup of jenkins.io in another file share, and ensured I could access its files and content.
I've triggered a fresh backup, and will activate the --delete
option at around 13~14h UTC, announcements incoming.
cc @zbynek @kmartens27 @MarkEWaite
This status PR is so aptly numbered 😁
Output of the trusted.ci job:
14:54:35 2023-11-02 13:54:35.713 INFO - blobxfer start time: 2023-11-02 13:54:35.713622+00:00 14:54:35 2023-11-02 13:54:35.741 DEBUG - initializing 4 MD5 processes 14:54:35 2023-11-02 13:54:35.747 DEBUG - spawning 16 disk threads 14:54:35 2023-11-02 13:54:35.756 DEBUG - spawning 32 transfer threads 14:55:53 2023-11-02 13:55:52.997 DEBUG - 0 files 0.0000 MiB filesize, lmt_ge, or no overwrite skipped 14:55:53 2023-11-02 13:55:52.997 DEBUG - 8175 local files processed, waiting for upload completion of approx. 370.4003 MiB 14:55:53 2023-11-02 13:55:52.999 DEBUG - attempting to delete extraneous blobs/files from: prodjenkinsio;core.windows.net;jenkinsio 14:56:37 2023-11-02 13:56:37.766 INFO - deleted 0 extraneous blobs/files 14:56:37 2023-11-02 13:56:37.766 INFO - elapsed upload + verify time and throughput of 0.0098 GiB: 77.136 sec, 1.0424 Mbps (0.130 MiB/s) 14:56:37 2023-11-02 13:56:37.766 INFO - blobxfer end time: 2023-11-02 13:56:37.766739+00:00 (elapsed: 122.053 sec)
Looking at the example links to outdated pages from https://github.com/jenkins-infra/jenkins.io/issues/6676, I noticed that these links now return a 403 error:
documentation changes within repo are not being updated on jenkins.io live, at least within /doc/book/system-administration.
to observe, compare; https://github.com/jenkins-infra/jenkins.io/blob/master/content/doc/book/system-administration/reverse-proxy-configuration-with-jenkins/reverse-proxy-configuration-nginx.adoc to; https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-nginx/
or https://github.com/jenkins-infra/jenkins.io/blob/master/content/doc/book/system-administration/reverse-proxy-configuration-with-jenkins/reverse-proxy-configuration-haproxy.adoc to; https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-haproxy/
The correct corresponding "new" links are https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-with-jenkins/reverse-proxy-configuration-nginx/ and https://www.jenkins.io/doc/book/system-administration/reverse-proxy-configuration-with-jenkins/reverse-proxy-configuration-haproxy/
Same for the outdated links from:
The total number of files in the Azure File Share decreased from 21051 to 15856.
As the outdated links now all return a 403 error when pointing to (now empty) folders, I've disable the trusted.ci.jenkins.io publication job to revert the --delete
blobxfer option and restore a backup to get more time to prepare the next steps:
Deleted files list:
404 would be better than 403, thanks for looking into that. Maybe it's enough to show the 404 page for the removed content and let visitors use the search instead of creating redirects, at least for most cases?
If I remember correctly it's the nginx config throwing an error with try files on the directory, or apache not having index option enabled.
Or maybe not. It was a long time ago we tried anything. Notes probably forever lost on irc
If I remember correctly it's the nginx config throwing an error with try files on the directory
You remember correctly buddy: it matches our analysis \o/
I think we fixed it on stories and/or plugins if you want to grab from there.
I think we fixed it on stories and/or plugins if you want to grab from there.
I'll just delete empty folders to get rid of the 403 errors.
404 would be better than 403, thanks for looking into that. Maybe it's enough to show the 404 page for the removed content and let visitors use the search instead of creating redirects, at least for most cases?
I propose to reactivate the --delete
option, to cleanup empty folders to get 404 errors instead of 403, and then to list the first block "blog, doc, etc." of pages in a jenkins.io issue so they can be evaluated and eventually redirected case by case later.
WDYT @zbynek @halkeye @kmartens27 @MarkEWaite @dduportal?
404 would be better than 403, thanks for looking into that. Maybe it's enough to show the 404 page for the removed content and let visitors use the search instead of creating redirects, at least for most cases?
I propose to reactivate the
--delete
option, to cleanup empty folders to get 404 errors instead of 403, and then to list the first block "blog, doc, etc." of pages in a jenkins.io issue so they can be evaluated and eventually redirected case by case later.WDYT @zbynek @halkeye @kmartens27 @MarkEWaite @dduportal?
LGTM for me on this plan (don't forget to announce it!)
+1 from me as well
That makes sense to me, +1 for me!
I propose to reactivate the
--delete
option, to cleanup empty folders to get 404 errors instead of 403, and then to list the first block "blog, doc, etc." of pages in a jenkins.io issue so they can be evaluated and eventually redirected case by case later.
Deletion activated again and empty folders deleted.
TODO: jenkins.io issue.
Update: we have 2 cases of "moved pages" caught by users:
=> the "configuration for reverse proxy" pages sounds like they will need redirections.
Opened a pull request to take care of the /doc/
content listed in https://github.com/jenkins-infra/helpdesk/issues/3360#issuecomment-1791059171:
/blog/2020/05/18/read-only-jenkins-announcement: index.html /blog/2020/06/04/digester-removal: index.html /blog/2020/07/30/winsw-yaml-support-2: index.html /blog/2022/01/06/gsoc-2022: index.html /blog/2022/11/14/hacktoberfest-recap: index.html /blog/2022/11/17/jenkins-election-candidates: index.html /blog/2023/03/29/android-and-jenkins: index.html /blog/2023/04/10/jenkins-newsletter: index.html /blog/2023/07/16/third-party-repository-detection-probe: index.html /blog/2023/09/09/incremental-build-detection-probe: index.html
Not sure it worth adding a redirection for them.
/doc/developer: .htaccess
: not needed anymore ✅
/doc/developer/404: index.html
: not needed anymore ✅
/doc/developer/architecture/security: index.html
: already redirects to https://www.jenkins.io/doc/developer/security/ ✅
/hangout: index.html
: no need for redirection, was redirecting to a Google Hangout (visio conference) ✅
/node/page: 104.html /node/page: 105.html /node/page: 106.html /node/page: 107.html /node/page: 108.html /node/page: 109.html /node/page: 110.html
/node/tags/essentials: index.html /node/tags/freemium: index.html /node/tags/jenkinsessentials: index.html
I don't know if something should be done for these.
/projects/blueocean/blueocean: index.html /projects/blueocean/roadmap: data.json
/projects/evergreen: index.html
No longer actively maintained, no redirection needed?
/projects/gsoc/2019/project-ideas/discard-builds-step-plugin: index.html /projects/gsoc/2020/project-ideas/artifactory-rest-plugin copy: index.html /projects/gsoc/2020/project-ideas/jx-consolidate-addons-and-apps: index.html /projects/gsoc/2023/project-ideas/JCasC-drift-detector: index.html /projects/gsoc/2023/project-ideas/agent_reconnections_exponential_backoff: index.html /projects/gsoc/2023/project-ideas/automating-plugin-buildmetadata-updates: index.html /projects/gsoc/2023/project-ideas/displaying_plugin_health_scores: index.html
I think these project ideas don't need to be referenced anymore.
/security/gift: index.html
: addressed in https://github.com/jenkins-infra/jenkins.io/pull/6817 ✅
/slides.yaml
: moved to content/_data/indexpage/carousel.yml ✅
/supporters.yaml
: moved to content/_data/indexpage/supporters.yml ✅
/user-handbook.pdf
: removed in #2374 ✅
@kmartens27 @MarkEWaite could you please take a look at my comment above and tell me if a jenkins.io issue should be opened to add some more redirections than the ones in https://github.com/jenkins-infra/jenkins.io/pull/6817
If not, I'll close this issue when the PR will be merged.
Closing as the work looks done, thanks y'all !
As noted in https://github.com/jenkins-infra/jenkins.io/pull/5940#discussion_r1091558214, when a page is removed from the repository, it's not removed from the source website www.origin.jenkins.io cached by Fastly.
We need to find a way to ensure these pages are removed and not indexed anymore.