jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

(Re) Introduce an artifact caching proxy for ci.jenkins.io #2752

Closed dduportal closed 1 year ago

dduportal commented 2 years ago

Service

ci.jenkins.io

Summary

As part of #2733 , the subject of hosting a caching proxy for ci.jenkins.io builds (at least: maybe for trusted.ci, release.ci and infra.ci also) as been re-triggered in https://groups.google.com/g/jenkins-infra/c/laSsgPOH9qs.

This issue tracks the work related to deploying this service.

Why

What

We want each build, run by ci.jenkins.io (and eventually trusted.ci and release.ci), which involves maven (and eventually gradle), to use our caching proxy service instead of directly hitting repo.jenkins-ci.org.

As per https://maven.apache.org/settings.html#mirrors, we should be able to use the User-level settings.xml for Maven.

There are different methods to provide this settings.xml to the build:

The main challenge is to provide multiple caching proxies, on each cloud region that we use. Rationale is that if we only have a single proxy, then we'll have to pay for the cross-cloud and/or cross-region bandwitdh , which we do not want. We could either:

Definition of Done

How

See associated PRs when they'll come.

jglick commented 2 years ago

First of all read #938 (reverted by #2047); I am not sure offhand which infra repo had the actual proxy configuration that you could use as a starting point. You would need to do a bit of digging. I recall it being nginx configured with a simple LRU cache of 2xx results, i.e., successful retrieval of release or *-SNAPSHOT artifacts or metadata XML files from public URLs. I suppose the K8s equivalent would be a StatefulSet with a cache volume.

what are the "ways" to use such a proxy caching in maven builds

At a first approximation, revert https://github.com/jenkins-infra/pipeline-library/pull/135 + https://github.com/jenkins-infra/pipeline-library/pull/216 + https://github.com/jenkins-infra/pipeline-library/pull/219 (but keeping some positive things from those PRs, such as removal of obsolete JDK 7 support).

dduportal commented 2 years ago

Many thanks for the pointers @jglick !

We've started refreshing https://github.com/jenkins-infra/docker-repo-proxy (https://github.com/jenkins-infra/docker-repo-proxy/pull/5) which has the behavior you describe so it means we are in the correct directions! (I'm currently trying this with a local build of a plugin before trying to deploy to production).

Sounds like with the informations you gave, we have enough to have a first version soon.

jglick commented 2 years ago

Oh https://github.com/jenkins-infra/docker-repo-proxy, I see.

If you get the service running, I can help draft a pipeline-library PR to use it. Just specify the URL. (Or would we have two URLs, one public via ingress and one cluster-internal for efficiency?) Not sure how we test such PRs prior to use; I guess you can override the version in a @Library annotation in some draft plugin PR.

timja commented 2 years ago

yeah you can access it via @Library('pipeline-library@refs/pull/number') or just push an origin branch

I was wondering if we would have a mirror per cloud? and then determine which cloud we were running on? to minimise bandwidth use but I guess that can be added on top

dduportal commented 2 years ago

Putting in pause (not enough bandwidth for the team for now) + Jforg works again as expected.

jglick commented 2 years ago

Slow again today AFAICT.

lemeurherve commented 2 years ago

I don't know if it's related but for the record, there is a maintenance in progress: https://github.com/jenkins-infra/helpdesk/issues/2806#issuecomment-1060862749

https://status.jfrog.io/incidents/j4726008yccx

image
jglick commented 2 years ago

2849

lemeurherve commented 2 years ago

Working on this, we realized we didn't need a custom nginx image as only its configuration was modified.

Consequently, I'm archiving jenkins-infra/docker-repo-proxy.

lemeurherve commented 2 years ago

Note: we'll probably use https://plugins.jenkins.io/config-file-provider/ in order to have specific settings.xml for each provider/region. I'll create an env var with the provider/region at the agent initialization so we can use it in the shared pipeline to choose the correct settings.xml (Ex: repo.azure.jenkins.io, repo.aws.jenkins.io, repo.do.jenkins.io), like what was done before https://github.com/jenkins-infra/pipeline-library/pull/216/files

lemeurherve commented 2 years ago

Regarding https://github.com/jenkins-infra/digitalocean/pull/63, I've manually added a do.jenkins.io NS record in jenkins.io DNS zone on Azure, pointing to DigitalOcean nameservers:

Details ![image](https://user-images.githubusercontent.com/91831478/189347246-de85b5e2-7c1b-4c3c-b76d-8099be0614d8.png)

To be reimported as code with https://github.com/jenkins-infra/helpdesk/issues/2924 & https://github.com/jenkins-infra/helpdesk/issues/2981

lemeurherve commented 2 years ago

We wanted initially to protect the access to these proxies by adding a basic authentication and an IPs whitelisting.

Unfortunately whitelisting all IPs used by the different agents will need some work, as currently (for example) every VM agent have their own IP.

We'll need to control network resources to use non default network setup in order to control public IPs.

For now I'll keep only the basic auth.

timja commented 2 years ago

is it a problem if people can access it? could be useful for debugging for developers.

dduportal commented 2 years ago

is it a problem if people can access it? could be useful for debugging for developers.

Yes it is: we are paying the outbound bandwidth, the storage for this new service and it's not cheap (currently witout the proxy, we have 2 to 3k€ per month on AWS and also on Azure of outbound bandwidth).

Also we must decrease the outbound bandwidth on repo.jenkins (Jfrog) of a factor of 5x to have Jfrog continuing to sponsor us: the main pain point being people using our infra as a public free mirror, which we are not expecting to do.

(PS : GitHub is drunk: I posted a comment and it edited your message 🤔 . I've edited it back)

timja commented 2 years ago

I mean is it a problem if people can access these mirrors for debugging? it's not like we would be advertising them.

dduportal commented 2 years ago

I mean is it a problem if people can access these mirrors for debugging? it's not like we would be advertising them.

Yep, it is still a problem as the URLs are stored in public code so any bot or abusive user could use it as a "free" mirror. Adding a user/password auth seems a nice proposal by @lemeurherve : it avoids the "allow/deny list of IP", and we can debug if we have access to the Kubernetes cluster (as the auth is only for the ingress: a port-forward to the service would bypass the auth).

lemeurherve commented 2 years ago

Created a CNAME record in jenkins.io DNS zone via Azure portal from repo.aws.jenkins.io to a0b8dc2af4aa74c9f8c27f542db939f1-1791101266.us-east-2.elb.amazonaws.com (the load balancer url I've obtained from the installation of ingress-nginx on cik8s)

dduportal commented 2 years ago

Status:

Todo:

dduportal commented 2 years ago

Additionnally:

jglick commented 2 years ago

mirror every repositories

Test carefully, e.g. https://github.com/jenkinsci/stapler/pull/404#issuecomment-1238327013 / #3115

dduportal commented 2 years ago

mirror every repositories

Test carefully, e.g. jenkinsci/stapler#404 (comment) / #3115

Thanks for the pointers, really useful for us to test!

Please note, in the current state and first version, that it would only be a "caching proxy": if you are able to make a given Maven project to work then it will be ok as it's not repo.jenkins directly, but a layer between that is able to reach the internet without going through repo.jenkins-ci and its mirroring.

dduportal commented 2 years ago

Status:

lemeurherve commented 1 year ago

Now that every provider has a proxy configured and running, and that the functionality has been integrated to the shared pipeline library as opt-in, I've opened PRs on the following plugins advised by @MarkEWaite to check it in situ:

These PR activate the use of an Artifact Caching Proxy caching the requests done to repo.jenkins-ci.org sponsored by JFrog, in order to reduce our bandwidth consumption and be more resilient.

Apart from an additional build log entry with the proxy provider configured for Maven depending on the agent location, there shouldn't be any change for any maintainer of these plugins.

There will be another PR to remove these changes as soon as the functionality would have been approved and switched to opt-out.

badges: [![embeddable-build-status-plugin](https://ci.jenkins.io/job/Plugins/job/embeddable-build-status-plugin/job/master/badge/icon?subject=embeddable-build-status-plugin)](https://ci.jenkins.io/job/Plugins/job/embeddable-build-status-plugin/job/master/) [![nodelabelparameter-plugin](https://ci.jenkins.io/job/Plugins/job/nodelabelparameter-plugin/job/master/badge/icon?subject=nodelabelparameter-plugin)](https://ci.jenkins.io/job/Plugins/job/nodelabelparameter-plugin/job/master/) [![schedule-build-plugin](https://ci.jenkins.io/job/Plugins/job/schedule-build-plugin/job/master/badge/icon?subject=schedule-build-plugin)](https://ci.jenkins.io/job/Plugins/job/schedule-build-plugin/job/master/) [![elastic-axis-plugin](https://ci.jenkins.io/job/Plugins/job/elastic-axis-plugin/job/master/badge/icon?subject=elastic-axis-plugin)](https://ci.jenkins.io/job/Plugins/job/elastic-axis-plugin/job/master/) [![implied-labels-plugin](https://ci.jenkins.io/job/Plugins/job/implied-labels-plugin/job/master/badge/icon?subject=implied-labels-plugin)](https://ci.jenkins.io/job/Plugins/job/implied-labels-plugin/job/master/) [![platformlabeler-plugin](https://ci.jenkins.io/job/Plugins/job/platformlabeler-plugin/job/master/badge/icon?subject=platformlabeler-plugin)](https://ci.jenkins.io/job/Plugins/job/platformlabeler-plugin/job/master/) [![priority-sorter-plugin](https://ci.jenkins.io/job/Plugins/job/priority-sorter-plugin/job/master/badge/icon?subject=priority-sorter-plugin)](https://ci.jenkins.io/job/Plugins/job/priority-sorter-plugin/job/master/) [![testng-plugin-plugin](https://ci.jenkins.io/job/Plugins/job/testng-plugin-plugin/job/master/badge/icon?subject=testng-plugin-plugin)](https://ci.jenkins.io/job/Plugins/job/testng-plugin-plugin/job/master/)
dduportal commented 1 year ago

Related:

dduportal commented 1 year ago

Moving this issue in "infra-team-sync-next" because work is done on https://github.com/jenkins-infra/helpdesk/issues/2844 to solve https://github.com/jenkins-infra/helpdesk/issues/3221.

dduportal commented 1 year ago

Next steps (in order):

dduportal commented 1 year ago

Update with the team-work today by @lemeurherve @smerle33 and I on the ACP tasks:

lemeurherve commented 1 year ago

Reopening to include more builds like jenkins, bom, etc. (List to be completed)

jglick commented 1 year ago

I also noticed in e.g. https://ci.jenkins.io/job/Core/job/jenkins/job/master/4585/flowGraphTable/ that Windows tests take more than twice as long as Linux tests, accounting for the majority of clock time. Using a repository cache should reduce the overhead time for a branch (time spent downloading deps & building rather than running tests), which would make it more practical to aggressively apply https://plugins.jenkins.io/parallel-test-executor/ (currently used only in acceptance-test-harness and kubernetes-plugin AFAICT). CC @jtnord @Vlatombe

lemeurherve commented 1 year ago

mirror every repositories

Test carefully, e.g. jenkinsci/stapler#404 (comment) / #3115

We forgot about this comment, resulting in #3382, fixed by https://github.com/jenkins-infra/jenkins-infra/pull/2630 & https://github.com/jenkinsci/stapler/pull/441

Is there a way to identify similar cases of artifacts not published in Maven Central?

MarkEWaite commented 1 year ago

All the successful plugin bill of materials jobs run over the weekend were run with the artifact caching proxy disabled. When the artifact caching proxy is enabled for plugin bill of materials jobs, there is a high overall failure rate of the job. The failure often does not become visible until 90 minutes or more into the job.

Some examples are visible at:

basil commented 1 year ago

In particular, search for repo.do.jenkins.io from the bottom of each log upwards. You'll see a bunch of I/O errors, socket read timeouts, "Premature end of Content-Length delimited message body" errors, etc.

jglick commented 1 year ago

MNG-714 would be helpful. I was hoping to use this trick but it did not seem to work. Created

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
    <mirrors>
        <mirror>
            <id>proxy</id>
            <url>https://repo.do.jenkins.io/public/</url>
            <mirrorOf>*,!repo.jenkins-ci.org</mirrorOf>
        </mirror>
    </mirrors>
    <profiles>
        <profile>
            <id>fallback</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <repositories>
                <repository>
                    <id>repo.jenkins-ci.org</id>
                    <url>https://repo.jenkins-ci.org/public/</url>
                </repository>
            </repositories>
            <pluginRepositories>
                <pluginRepository>
                    <id>repo.jenkins-ci.org</id>
                    <url>https://repo.jenkins-ci.org/public/</url>
                </pluginRepository>
            </pluginRepositories>
        </profile>
    </profiles>
</settings>

where the mirror is expected to fail (since I am providing no authentication) and ran with

docker run --rm -ti --entrypoint bash -v /tmp/settings.xml:/usr/share/maven/conf/settings.xml maven:3-eclipse-temurin-17 -c 'git clone --depth 1 https://github.com/jenkinsci/build-token-root-plugin /src && cd /src && mvn -Pquick-build install'

but it fails immediately and does not fall back. additional-identities-plugin which does not use an extension from Central builds OK but does not use the proxy.

lemeurherve commented 1 year ago

After clearing the cache of the DigitalOcean provider, a BOM build exclusively on DigitalOcean finished with success: https://ci.jenkins.io/job/Tools/job/bom/job/master/1564/

The fact the BOM builds failed only on DO with "Premature end of Content-Length delimited message body" each time, and passed after clearing the cache on this provider make me think the error came from corrupted cache data.

I'll check to either find a way to clear the cache for a specific artifact, or either reduce the cache retention currently set to one month.

lemeurherve commented 1 year ago

@MarkEWaite @basil could you try your next BOM builds without the skip-artifact-caching-proxy label please?

jglick commented 1 year ago

Can try https://github.com/jenkinsci/bom/pull/1916

jglick commented 1 year ago

or https://github.com/jenkinsci/bom/pull/1907

jglick commented 1 year ago

FYI https://issues.apache.org/jira/browse/MNG-7708 (probably not relevant if the cache errors were persistent).

dduportal commented 1 year ago

Closing as the "unreliable" behavior (which is BOM-only) is tracked in https://github.com/jenkins-infra/helpdesk/issues/3481