dduportal commented 2 years ago

Service

ci.jenkins.io

Summary

As part of #2733 , the subject of hosting a caching proxy for ci.jenkins.io builds (at least: maybe for trusted.ci, release.ci and infra.ci also) as been re-triggered in https://groups.google.com/g/jenkins-infra/c/laSsgPOH9qs.

This issue tracks the work related to deploying this service.

Why

Protect the CI builds of the Jenkins contributors from external JFrog repository (https://repo.jenkins-ci.org) slowness or outage
- Note that it would be a partial protection in case of outage: items not cached would not be available at all
Decrease the outbound bandwitdh of JFrog's repository: caching items would have an impact, as part of the "fairness" of any opensource sponsorship like this one

What

We want each build, run by ci.jenkins.io (and eventually trusted.ci and release.ci), which involves maven (and eventually gradle), to use our caching proxy service instead of directly hitting repo.jenkins-ci.org.

As per https://maven.apache.org/settings.html#mirrors, we should be able to use the User-level settings.xml for Maven.

There are different methods to provide this settings.xml to the build:

Adding it in the agent images in jenkins-infra/packer-images (assuming we have finished the "docker and VMs" tasks, ref. https://github.com/jenkins-infra/packer-images/issues/282 for linux and https://github.com/jenkins-infra/packer-images/issues/285)
Use the Jenkins plugin "config-file-provider" , which support pipeline: https://plugins.jenkins.io/config-file-provider/#plugin-content-using-the-configuration-files-in-jenkins-pipelines , so we could set it up in the jenkins-infra/pipeline-library (easier to opt-out and faster to disable in case of outage)

The main challenge is to provide multiple caching proxies, on each cloud region that we use. Rationale is that if we only have a single proxy, then we'll have to pay for the cross-cloud and/or cross-region bandwitdh , which we do not want. We could either:

Using agent template labels to specify which cloud provider and which region it is running on. Not sure if the config-file-provider could detect the agent labels. Or could the pipeline library code retireve the current's node labels?
We could host a mirror system, like get.jenkins.io, that would redirect request to the proxy which is the closest or fallback to repo.jenkins-ci.org otherwise

Definition of Done

[x] Refresh the jenkins-infra/docker-repo-proxy to have a versionned and up-to-date Docker image for the proxy caching
[x] Add an helm chart in jenkins-infra/helm-charts for a repo-proxy installation as helm chart
[x] Add a service in jenkins-infra/kubernetes-management to host the service
[x] Communicate about the new service and list what need to be done around Maven configuration on ci.jenkins.io (ping @timja @jglick @MarkEWaite if you can help by refreshing our memory on what are the "ways" to use such a proxy caching in maven builds for ci.j: controller config file, agent config, network config, pipeline library update, all of the above, other ?)

How

See associated PRs when they'll come.

jglick commented 2 years ago

First of all read #938 (reverted by #2047); I am not sure offhand which infra repo had the actual proxy configuration that you could use as a starting point. You would need to do a bit of digging. I recall it being nginx configured with a simple LRU cache of 2xx results, i.e., successful retrieval of release or *-SNAPSHOT artifacts or metadata XML files from public URLs. I suppose the K8s equivalent would be a StatefulSet with a cache volume.

what are the "ways" to use such a proxy caching in maven builds

At a first approximation, revert https://github.com/jenkins-infra/pipeline-library/pull/135 + https://github.com/jenkins-infra/pipeline-library/pull/216 + https://github.com/jenkins-infra/pipeline-library/pull/219 (but keeping some positive things from those PRs, such as removal of obsolete JDK 7 support).

dduportal commented 2 years ago

Many thanks for the pointers @jglick !

We've started refreshing https://github.com/jenkins-infra/docker-repo-proxy (https://github.com/jenkins-infra/docker-repo-proxy/pull/5) which has the behavior you describe so it means we are in the correct directions! (I'm currently trying this with a local build of a plugin before trying to deploy to production).

Sounds like with the informations you gave, we have enough to have a first version soon.

jglick commented 2 years ago

Oh https://github.com/jenkins-infra/docker-repo-proxy, I see.

If you get the service running, I can help draft a pipeline-library PR to use it. Just specify the URL. (Or would we have two URLs, one public via ingress and one cluster-internal for efficiency?) Not sure how we test such PRs prior to use; I guess you can override the version in a @Library annotation in some draft plugin PR.

timja commented 2 years ago

yeah you can access it via @Library('pipeline-library@refs/pull/number') or just push an origin branch

I was wondering if we would have a mirror per cloud? and then determine which cloud we were running on? to minimise bandwidth use but I guess that can be added on top

dduportal commented 2 years ago

Putting in pause (not enough bandwidth for the team for now) + Jforg works again as expected.

jglick commented 2 years ago

Slow again today AFAICT.

lemeurherve commented 2 years ago

I don't know if it's related but for the record, there is a maintenance in progress: https://github.com/jenkins-infra/helpdesk/issues/2806#issuecomment-1060862749

https://status.jfrog.io/incidents/j4726008yccx

jglick commented 2 years ago

2849

lemeurherve commented 2 years ago

Working on this, we realized we didn't need a custom nginx image as only its configuration was modified.

Consequently, I'm archiving jenkins-infra/docker-repo-proxy.

lemeurherve commented 2 years ago

~~Note: we'll probably use https://plugins.jenkins.io/config-file-provider/ in order to have specific settings.xml for each provider/region.~~ I'll create an env var with the provider/region at the agent initialization so we can use it in the shared pipeline to choose the correct settings.xml (Ex: repo.azure.jenkins.io, repo.aws.jenkins.io, repo.do.jenkins.io), like what was done before https://github.com/jenkins-infra/pipeline-library/pull/216/files

lemeurherve commented 2 years ago

Regarding https://github.com/jenkins-infra/digitalocean/pull/63, I've manually added a do.jenkins.io NS record in jenkins.io DNS zone on Azure, pointing to DigitalOcean nameservers:

Details

![image](https://user-images.githubusercontent.com/91831478/189347246-de85b5e2-7c1b-4c3c-b76d-8099be0614d8.png)

To be reimported as code with https://github.com/jenkins-infra/helpdesk/issues/2924 & https://github.com/jenkins-infra/helpdesk/issues/2981

lemeurherve commented 2 years ago

We wanted initially to protect the access to these proxies by adding a basic authentication and an IPs whitelisting.

Unfortunately whitelisting all IPs used by the different agents will need some work, as currently (for example) every VM agent have their own IP.

We'll need to control network resources to use non default network setup in order to control public IPs.

For now I'll keep only the basic auth.

timja commented 2 years ago

is it a problem if people can access it? could be useful for debugging for developers.

dduportal commented 2 years ago

is it a problem if people can access it? could be useful for debugging for developers.

Yes it is: we are paying the outbound bandwidth, the storage for this new service and it's not cheap (currently witout the proxy, we have 2 to 3k€ per month on AWS and also on Azure of outbound bandwidth).

Also we must decrease the outbound bandwidth on repo.jenkins (Jfrog) of a factor of 5x to have Jfrog continuing to sponsor us: the main pain point being people using our infra as a public free mirror, which we are not expecting to do.

(PS : GitHub is drunk: I posted a comment and it edited your message 🤔 . I've edited it back)

timja commented 2 years ago

I mean is it a problem if people can access these mirrors for debugging? it's not like we would be advertising them.

dduportal commented 2 years ago

I mean is it a problem if people can access these mirrors for debugging? it's not like we would be advertising them.

Yep, it is still a problem as the URLs are stored in public code so any bot or abusive user could use it as a "free" mirror. Adding a user/password auth seems a nice proposal by @lemeurherve : it avoids the "allow/deny list of IP", and we can debug if we have access to the Kubernetes cluster (as the auth is only for the ingress: a port-forward to the service would bypass the auth).

lemeurherve commented 2 years ago

Created a CNAME record in jenkins.io DNS zone via Azure portal from repo.aws.jenkins.io to a0b8dc2af4aa74c9f8c27f542db939f1-1791101266.us-east-2.elb.amazonaws.com (the load balancer url I've obtained from the installation of ingress-nginx on cik8s)

dduportal commented 2 years ago

Status:

1 instance deployed in Azure (prodpublick8s)
1 instance in a new DO cluster (public facing)
Wip: 1 instance in a new EKS cluster (public facing)

Todo:

Once all deployed: first "test" with a set of selected plugins (opt-in in the pipeline library)
If tests are working as expected: switch to opt-out

dduportal commented 2 years ago

Additionnally:

Let's mirror every repositories, not only repo.jenkin-ci.org's public and incrementals (settings.xml update => <mirrorOf>*
- Increase volume size
Authentication: proxies are requiring client-side authentication

jglick commented 2 years ago

mirror every repositories

Test carefully, e.g. https://github.com/jenkinsci/stapler/pull/404#issuecomment-1238327013 / #3115

dduportal commented 2 years ago

mirror every repositories

Test carefully, e.g. jenkinsci/stapler#404 (comment) / #3115

Thanks for the pointers, really useful for us to test!

Please note, in the current state and first version, that it would only be a "caching proxy": if you are able to make a given Maven project to work then it will be ok as it's not repo.jenkins directly, but a layer between that is able to reach the internet without going through repo.jenkins-ci and its mirroring.

dduportal commented 2 years ago

Status:

Instances deployed on AKS and DO
EKS work in progress (issue on the LoadBalancer part for the new)

lemeurherve commented 1 year ago

Now that every provider has a proxy configured and running, and that the functionality has been integrated to the shared pipeline library as opt-in, I've opened PRs on the following plugins advised by @MarkEWaite to check it in situ:

embeddable-build-status-plugin
nodelabelparameter-plugin
schedule-build-plugin
elastic-axis-plugin
implied-labels-plugin
platformlabeler-plugin
priority-sorter-plugin
testng-plugin-plugin

These PR activate the use of an Artifact Caching Proxy caching the requests done to repo.jenkins-ci.org sponsored by JFrog, in order to reduce our bandwidth consumption and be more resilient.

Apart from an additional build log entry with the proxy provider configured for Maven depending on the agent location, there shouldn't be any change for any maintainer of these plugins.

There will be another PR to remove these changes as soon as the functionality would have been approved and switched to opt-out.

badges:

[![embeddable-build-status-plugin](https://ci.jenkins.io/job/Plugins/job/embeddable-build-status-plugin/job/master/badge/icon?subject=embeddable-build-status-plugin)](https://ci.jenkins.io/job/Plugins/job/embeddable-build-status-plugin/job/master/) [![nodelabelparameter-plugin](https://ci.jenkins.io/job/Plugins/job/nodelabelparameter-plugin/job/master/badge/icon?subject=nodelabelparameter-plugin)](https://ci.jenkins.io/job/Plugins/job/nodelabelparameter-plugin/job/master/) [![schedule-build-plugin](https://ci.jenkins.io/job/Plugins/job/schedule-build-plugin/job/master/badge/icon?subject=schedule-build-plugin)](https://ci.jenkins.io/job/Plugins/job/schedule-build-plugin/job/master/) [![elastic-axis-plugin](https://ci.jenkins.io/job/Plugins/job/elastic-axis-plugin/job/master/badge/icon?subject=elastic-axis-plugin)](https://ci.jenkins.io/job/Plugins/job/elastic-axis-plugin/job/master/) [![implied-labels-plugin](https://ci.jenkins.io/job/Plugins/job/implied-labels-plugin/job/master/badge/icon?subject=implied-labels-plugin)](https://ci.jenkins.io/job/Plugins/job/implied-labels-plugin/job/master/) [![platformlabeler-plugin](https://ci.jenkins.io/job/Plugins/job/platformlabeler-plugin/job/master/badge/icon?subject=platformlabeler-plugin)](https://ci.jenkins.io/job/Plugins/job/platformlabeler-plugin/job/master/) [![priority-sorter-plugin](https://ci.jenkins.io/job/Plugins/job/priority-sorter-plugin/job/master/badge/icon?subject=priority-sorter-plugin)](https://ci.jenkins.io/job/Plugins/job/priority-sorter-plugin/job/master/) [![testng-plugin-plugin](https://ci.jenkins.io/job/Plugins/job/testng-plugin-plugin/job/master/badge/icon?subject=testng-plugin-plugin)](https://ci.jenkins.io/job/Plugins/job/testng-plugin-plugin/job/master/)

dduportal commented 1 year ago

Moving this issue in "infra-team-sync-next" because work is done on https://github.com/jenkins-infra/helpdesk/issues/2844 to solve https://github.com/jenkins-infra/helpdesk/issues/3221.

dduportal commented 1 year ago

ACP is now working as expected (see #3221 and 3302)
Improvement to the ACP configuration (Nginx) to improve diagnosis: https://github.com/jenkins-infra/helm-charts/pull/399
Enabling cache of SNAPSHOTs: https://github.com/jenkins-infra/helm-charts/pull/401
New publick8s cluster handling a new Azure ACP aimed to be the "default" one: #3351

Next steps (in order):

Define "as code" on ci.jenkins.io the global node property controlling the available providers
- Revert back the "hotfix" I pushed today as pointed by @lemeurherve
- Write a runbook to operate the ACP
Focus on reviewing, deploying, testing and documenting (e.g. communicating to users) the PR https://github.com/jenkins-infra/pipeline-library/pull/552 to provide an "opt-out"
Check numbers (metrics, build times and statuses) after a few days to decide wether or not to switch to "all plugins"
Fix the performances issues in Azure ACP
- Diagnose if related to having 2 replicas, to network topology or something else
- Fix the issue
- Set ci.jenkins.io to using it when spawning Azure agents
If using 2 replicas is NOT an issue, then do it for AWS and DO to avoid breaking end users setups when operating clusters

dduportal commented 1 year ago

Update with the team-work today by @lemeurherve @smerle33 and I on the ACP tasks:

Azure ACP Debugging topic. TL;DR.; now it works © (but we don't know why it was so slow)
- Scaling down to 1 replica: https://github.com/jenkins-infra/kubernetes-management/pull/3534
- Next build of jenkins-infra-test-plugin with Azure VM + ACP Azure wasmore than 1h !!
- Deletion of the statefulset PVC (scaled to zero, pv/pvc deletion, scaled back to 1) to force-recreate an empty service: build was ~3min (and then below the minute).
- Scaled to 2 replicas: performances are now on the menu \o/ We can set back the ACP configuration to nominal setup (Azure agents -> Azure ACP, AWS agents to AWS ACP and DO agents to DO -> ACP, AAAAAAAND Azure ACP as the default fallback)
Next steps:
- ACP setup to nominal configuration:
- Revert hotfix puppet defaulting to aws - https://github.com/jenkins-infra/jenkins-infra/pull/2621
- Revert pipeline-library defaulting to aws - https://github.com/jenkins-infra/pipeline-library/pull/573
- PR to add the global env var $ARTIFACT_CACHING_AVAILABLE_PROVIDERS on ci.jenkins.io
- PR to define/update the 3 settings.xml files on ci.jenkin.io to mirror everything - https://github.com/jenkins-infra/jenkins-infra/pull/2622
- PR to set ACP to 2 replicas (for HA when operating clusters) everywhere - https://github.com/jenkins-infra/kubernetes-management/pull/3536#pullrequestreview-1275657364
- Preparing the "opt-in using ACP by default for all plugins":
- PR on pipeline-library to check for "skip-artifact-caching-proxy" label - https://github.com/jenkins-infra/pipeline-library/pull/552
- write a runbook to operate ACP on ci.jenkins.io (how to switch on/off, how to enable/disable providers)
- Improvement for sustainability:
- PR puppet + pipeline-library to add a new global env var defining the "default fallback" ACP (instead having raw value in pipeline library that led me to hotfixes)
  - https://github.com/jenkins-infra/pipeline-library/pull/574
- PRs on the ACP helm-chart:
  - fix on the access log format (missing space)
  - Add Kubernetes anti-affinity since we target replicas
  - Add nodepool tolerations to avoid spawning ACP pods on system pools (due to smaller machines and resources constraints)

lemeurherve commented 1 year ago

Reopening to include more builds like jenkins, bom, etc. (List to be completed)

jglick commented 1 year ago

I also noticed in e.g. https://ci.jenkins.io/job/Core/job/jenkins/job/master/4585/flowGraphTable/ that Windows tests take more than twice as long as Linux tests, accounting for the majority of clock time. Using a repository cache should reduce the overhead time for a branch (time spent downloading deps & building rather than running tests), which would make it more practical to aggressively apply https://plugins.jenkins.io/parallel-test-executor/ (currently used only in acceptance-test-harness and kubernetes-plugin AFAICT). CC @jtnord @Vlatombe

lemeurherve commented 1 year ago

mirror every repositories

Test carefully, e.g. jenkinsci/stapler#404 (comment) / #3115

We forgot about this comment, resulting in #3382, fixed by https://github.com/jenkins-infra/jenkins-infra/pull/2630 & https://github.com/jenkinsci/stapler/pull/441

Is there a way to identify similar cases of artifacts not published in Maven Central?

MarkEWaite commented 1 year ago

All the successful plugin bill of materials jobs run over the weekend were run with the artifact caching proxy disabled. When the artifact caching proxy is enabled for plugin bill of materials jobs, there is a high overall failure rate of the job. The failure often does not become visible until 90 minutes or more into the job.

Some examples are visible at:

https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-1888/ with log files copied to https://home.markwaite.net/~mwaite/artfact-caching-proxy-failures/PR-1888/ as requested by https://github.com/jenkinsci/bom/pull/1888
https://ci.jenkins.io/job/Tools/job/bom/job/master/1548/

basil commented 1 year ago

In particular, search for repo.do.jenkins.io from the bottom of each log upwards. You'll see a bunch of I/O errors, socket read timeouts, "Premature end of Content-Length delimited message body" errors, etc.

jglick commented 1 year ago

MNG-714 would be helpful. I was hoping to use this trick but it did not seem to work. Created

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
    <mirrors>
        <mirror>
            <id>proxy</id>
            <url>https://repo.do.jenkins.io/public/</url>
            <mirrorOf>*,!repo.jenkins-ci.org</mirrorOf>
        </mirror>
    </mirrors>
    <profiles>
        <profile>
            <id>fallback</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <repositories>
                <repository>
                    <id>repo.jenkins-ci.org</id>
                    <url>https://repo.jenkins-ci.org/public/</url>
                </repository>
            </repositories>
            <pluginRepositories>
                <pluginRepository>
                    <id>repo.jenkins-ci.org</id>
                    <url>https://repo.jenkins-ci.org/public/</url>
                </pluginRepository>
            </pluginRepositories>
        </profile>
    </profiles>
</settings>

where the mirror is expected to fail (since I am providing no authentication) and ran with

docker run --rm -ti --entrypoint bash -v /tmp/settings.xml:/usr/share/maven/conf/settings.xml maven:3-eclipse-temurin-17 -c 'git clone --depth 1 https://github.com/jenkinsci/build-token-root-plugin /src && cd /src && mvn -Pquick-build install'

but it fails immediately and does not fall back. additional-identities-plugin which does not use an extension from Central builds OK but does not use the proxy.

lemeurherve commented 1 year ago

After clearing the cache of the DigitalOcean provider, a BOM build exclusively on DigitalOcean finished with success: https://ci.jenkins.io/job/Tools/job/bom/job/master/1564/

The fact the BOM builds failed only on DO with "Premature end of Content-Length delimited message body" each time, and passed after clearing the cache on this provider make me think the error came from corrupted cache data.

I'll check to either find a way to clear the cache for a specific artifact, or either reduce the cache retention currently set to one month.

lemeurherve commented 1 year ago

@MarkEWaite @basil could you try your next BOM builds without the skip-artifact-caching-proxy label please?

jglick commented 1 year ago

Can try https://github.com/jenkinsci/bom/pull/1916

jglick commented 1 year ago

or https://github.com/jenkinsci/bom/pull/1907

jglick commented 1 year ago

FYI https://issues.apache.org/jira/browse/MNG-7708 (probably not relevant if the cache errors were persistent).

dduportal commented 1 year ago

Closing as the "unreliable" behavior (which is BOM-only) is tracked in https://github.com/jenkins-infra/helpdesk/issues/3481

jenkins-infra / helpdesk

(Re) Introduce an artifact caching proxy for ci.jenkins.io #2752

Service

Summary

Why

What

Definition of Done

How

2849