odpi / ci-management

3 stars 7 forks source link

Publish egeria docker images to dockerhub using odpi organization #44

Closed planetf1 closed 4 years ago

planetf1 commented 5 years ago

As part of the egeria build process we create docker images for a variety of components. For example the core egeria code is built at https://github.com/odpi/egeria/tree/master/open-metadata-resources/open-metadata-deployment/docker/egeria

Currently these are either built locally, or for general consumption I have posted using my docker repo 'planetf1' . for example see https://cloud.docker.com/u/planetf1/repository/docker/planetf1/egeria-egeriavdc

We need to ensure that anyone can pull these images for use - including as part of deploying out helm chart such as https://github.com/odpi/egeria/tree/master/open-metadata-resources/open-metadata-deployment/charts/vdc

There is already an odpi organization on dockerhub - see https://hub.docker.com/r/odpi/ci

can we make use of this as part of our build process. Since the images take a while to build this is an optional extension to the regular build process

mvn clean install -Ddocker -Ddocker.repo=odpi

should suffice

However it would be interesting to see what 'best practice' is in this space - for example how do other linux foundation projects tackle this.

Another approach is to configure dockerhub itself to perform the image build for us.

I should also note that we are posting egeria code, and that of other projects we use, though all are Apache 2.0 licensed.

planetf1 commented 5 years ago

We noticed today that an egeria PR is kicking off the docker build and making it a blocking check. It failed - not due to a coding error, but seemingly due to infrastructure problems.

See https://github.com/odpi/egeria/pull/1100

Can we just run the docker build after master merge for now?

I think for a PR check we need to refactor the docker build somewhat - for example by not publishing to docker hub, but to a staging repo. Also whilst most changes in egeria itself would require the egeria docker build to be done, defining the conditions under which the other builds are needed is a little harder (and needs more thought about the overall pipeline)

The current structure of the docker builds means

schannamallu commented 5 years ago

@planetf1

egeria PR is kicking off the docker build and making it a blocking check. It failed - not due to a coding error, but seemingly due to infrastructure problems.

yah i agree with you.

Can we just run the docker build after master merge for now?

we can move verify to merge but there is slight problem during build as the build it calls a script 'maven-patch-release.sh' which makes build fail always. and also the merge job use the stage profile to transfer builds to nexus . In our case its is DockerHub

So we though 'verify' is correct one to use ,unfortunately the builds are failing as node connection timeouts.

I think for a PR check we need to refactor the docker build somewhat - for example by not publishing to docker hub, but to a staging repo. Also whilst most changes in egeria itself would require the egeria docker build to be done, defining the conditions under which the other builds are needed is a little harder (and needs more thought about the overall pipeline)

we will do and let me check the nexus repo connectivity for storing docker images.

schannamallu commented 5 years ago

@planetf1 i can see the docker builds are queue almost 20. This is creating unwanted issue like load/performance issues Can i delete the docker related job as of now and create after clearing issues like node fail, long time or Can i disable all docker projects until get solution? we should come up with a solutions i will wait for your reply.

schannamallu commented 5 years ago

@planetf1, As per your above comment and we observed that docker verify job is triggering for each PR and which is flooding the jenkins with verify jobs. Moved the docker jobs from verify job to stage jobs. So which will not trigger for every PR.

107

So can you confirm this works for you.. If you agrees will merge this PR. CC: @jwagantall

planetf1 commented 5 years ago

Yes -- moving out of the verify process is essential as it is impacting members of our team working on other PRs. That will be an immediate relief.. then we can review any further fine tuning.So please can you merge the ci-management PR?

schannamallu commented 5 years ago

@planetf1 Okay, I will merge #107

schannamallu commented 5 years ago

@planetf1, Merged. Can you check once. I observe couple of jobs are failing. Can you check is it because of code issue.

planetf1 commented 5 years ago

https://jenkins.odpi.org/view/egeria-docker/job/egeria-ranger-maven-docker-stage-master/5/ - https://jenkins.odpi.org/view/egeria-docker/job/egeria-ranger-maven-docker-stage-master/5/

Both of the above failed on an apparent infrastructure issues - looks like timeout or resource issue as the builder just went away?

https://jenkins.odpi.org/view/egeria-docker/job/egeria-apache-atlas-maven-docker-stage-master/5/console

This failed as it couldn't resolve www.apache.org to retrieve artifacts needed for the build. As such a well known site this also very much looks like infrastructure issues.

So I think all the problems ARE NOT issues with our code, but rather the infra?

schannamallu commented 5 years ago

@planetf1 i will check with infra team and will update you. Thank you

schannamallu commented 5 years ago

@planetf1, Today I observing all the docker jobs are getting failed with the below nexus access error. Is it related to some nexus problem or any new changes in code?

[INFO] Error stacktraces are turned on. [INFO] Scanning for projects... [INFO] [INFO] -------------< org.odpi.egeria:open-metadata-docker-gaian >------------- [INFO] Building Docker image - Gaian 1.1 [INFO] --------------------------------[ pom ]--------------------------------- [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 0.236 s [INFO] Finished at: 2019-06-17T16:55:19Z [INFO] ------------------------------------------------------------------------ [ERROR] Plugin org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-enforcer-plugin:jar:3.0.0-M1: Could not transfer artifact org.apache.maven.plugins:maven-enforcer-plugin:pom:3.0.0-M1 from/to releases (${env.NEXUS_URL}/content/repositories/releases/): Cannot access ${env.NEXUS_URL}/content/repositories/releases/ with type default using the available connector factories: BasicRepositoryConnectorFactory: Cannot access ${env.NEXUS_URL}/content/repositories/releases/ using the registered transporter factories: WagonTransporterFactory: Unsupported transport protocol -> [Help 1] org.apache.maven.plugin.PluginResolutionException: Plugin org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-enforcer-plugin:jar:3.0.0-M1 at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve (DefaultPluginDependenciesResolver.java:117) at org.apache.maven.plugin.internal.DefaultMavenPluginManager.getPluginDescriptor (DefaultMavenPluginManager.java:182) at org.apache.maven.plugin.internal.DefaultMavenPluginManager.getMojoDescriptor (DefaultMavenPluginManager.java:286) at org.apache.maven.plugin.DefaultBuildPluginManager.getMojoDescriptor (DefaultBuildPluginManager.java:244) at org.apache.maven.lifecycle.internal.DefaultLifecycleMappingDelegate.calculateLifecycleMappings (DefaultLifecycleMappingDelegate.java:116) at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateLifecycleMappings (DefaultLifecycleExecutionPlanCalculator.java:265) at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateMojoExecutions (DefaultLifecycleExecutionPlanCalculator.java:217) at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateExecutionPlan (DefaultLifecycleExecutionPlanCalculator.java:126) at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateExecutionPlan (DefaultLifecycleExecutionPlanCalculator.java:144) at org.apache.maven.lifecycle.internal.builder.BuilderCommon.resolveBuildPlan (BuilderCommon.java:97) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:111) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)

planetf1 commented 5 years ago

Those errors look as if there is something wrong with the scripts/variables ... the maven enforcer plugin has been there a long time - and is used very early on in the process.....

We have been making some pom changes, but any changes there should affect the regular egeria builds too - which seem fine

schannamallu commented 5 years ago

@planetf1. I executed the build manually on a cloud node and got the timeout error. As checked internally, we are pulling snapshots from upstream? This could be the cause for the timeout issue. Do we have any other option to not to pull the snapshots from upstream in the project..? That may be helpful in this case.

schannamallu commented 5 years ago

@planetf1 , Please let me know any update.

planetf1 commented 5 years ago

I am able to build locally on MacOS laptop, and on a cloud-hosted ubuntu system though it take a long time (hour) for ranger, a little less for atlas (20 mins?)

Currently we pull ranger from master (github)_ - this could likely be defaulted to the last release instead, but I don't see this would substantially change the build time or dependencies. The ranger team will define which specific versions of dependencies they use, but in general they aren't snapshots

Can you clarify what you mean by pulling snapshots from upstream?

Also I note there is an effort to split the build to try and improve the process - though I suspect the base build may still be problematic. see https://github.com/odpi/egeria/pull/1178

planetf1 commented 5 years ago

Have also asked for recommendations from ranger team: https://lists.apache.org/thread.html/ae514723cb9c47ac00d55d07470dedd0c3813f80bfc6d9fbd0fa98b6@%3Cdev.ranger.apache.org%3E

schannamallu commented 5 years ago

I am able to build locally on MacOS laptop, and on a cloud-hosted ubuntu system though it take a long time (hour) for ranger, a little less for atlas (20 mins?)

Currently we pull ranger from master (github)_ - this could likely be defaulted to the last release instead, but I don't see this would substantially change the build time or dependencies. The ranger team will define which specific versions of dependencies they use, but in general they aren't snapshots

Can you clarify what you mean by pulling snapshots from upstream?

@planetf1 upstream means from nexus. we though build time may decreases if we place setting.xml in the mvn. Hope atlas project already have settings but build time is same as usual.

Also I note there is an effort to split the build to try and improve the process - though I suspect the base build may still be problematic. see odpi/egeria#1178

yes, split process will made our job simple and seems build also wont take long time.

planetf1 commented 5 years ago

I think it would be an ok optimization to use the nexus repo for the maven artifacts (just as many orgs will use their own internal maven repo as a cache). That would just be additional settings for the build process though with the git source just referring to default repos.

Fundamentally though the atlas build takes a long time. it's hard to get away from that. As @cmgrote works with atlas more he may identify a better approach - maybe we can cut down on what modules to build.

schannamallu commented 5 years ago

@planetf1 I solved the node termination problem. Its big relief for me

I ran a Ranger job for testing. https://jenkins.odpi.org/view/egeria-docker/job/egeria-ranger-maven-docker-stage-master/30/

what I observe is

  1. The build consumes a couple of hours. Say more than 8 hours still build is running. this is odd. I stopped the build.

I think it's better to add a setting.xml to the mvn in the Dockerfile for ranger and in Docker.build for apache-atlas

---settings.xml---

<?xml version="1.0" encoding="UTF-8"?>
<!-- SPDX-License-Identifier: Apache-2.0 -->
<!-- Copyright Contributors to the ODPi Egeria project. -->
<settings>
 <profiles>
   <profile>
     <id>odpi</id>
     <repositories>
       <repository>
         <id>odpi-snapshots</id>
         <name>ODPi Snapshots</name>
         <url>https://nexus.odpi.org/content/groups/public/</url>
       </repository>
     </repositories>
   </profile>
 </profiles>

 <activeProfiles>
   <activeProfile>odpi</activeProfile>
 </activeProfiles>
</settings>

Please let know any possiblities to speed up builds. cc: @cmgrote

planetf1 commented 5 years ago

I think https://github.com/odpi/egeria/issues/1251 would fix the ranger issue?

cmgrote commented 5 years ago

I think odpi/egeria#1251 would fix the ranger issue?

For sure a change we should make, as it gives us a consistent start point for the build images (so the splits that are in progress will very rarely ever need to do the expensive re-build of the "build" images). However, the initial build of those "build" images will still be time-consuming because we're still building from source rather than just deploying some pre-built binaries...

schannamallu commented 5 years ago

I think odpi/egeria#1251 would fix the ranger issue?

For sure a change we should make, as it gives us a consistent start point for the build images (so the splits that are in progress will very rarely ever need to do the expensive re-build of the "build" images). However, the initial build of those "build" images will still be time-consuming because we're still building from source rather than just deploying some pre-built binaries...

@cmgrote, Even though we split and building the base images very often. I assume during the base build images Jenkins may take time to build ( in hours)

so though adding settings.xml would help Jenkins to speed up the process.

cmgrote commented 4 years ago

@schannamallu apologies, I'm not sure what the settings.xml does in regards to the Jenkins process (?) What we'd ideally have is some way of avoiding Jenkins going through the first Docker.build image construction unless there is a change to that Docker.build file (if there is no change to this file, because we're building the docker image from downloaded content that will be static (not dynamically-changing code from a Git repo), there will be no change to the resulting image -- so no reason to re-build it).

The only repeated re-building we should be doing is the Dockerfile piece of the split: which should pickup the pre-built (or very rarely re-built) image that results from the Docker.build process directly from Docker hub as its first step. This is a potentially significant download, but not any actual "work".

(In fact, even for these non-Egeria images I think we should only bother re-running the build if the Dockerfile itself changes.)

Leads to a couple of questions:

  1. Is there some way we can configure Jenkins to only run these Docker builds if some limited set of files in the source tree are changed, rather than on every commit / merge?
  2. I've also noticed that currently the apache-atlas image build is consistently failing, which appears to be down to network timeouts against www.apache.org -- which I can only assume is because we've somehow been blocked / blacklisted (the site always seems to be up and immediately responsive from my end)... Any ideas there?
schannamallu commented 4 years ago

@cmgrote i am working on that. will update soon

cmgrote commented 4 years ago

I'm wondering if it would help if we more cleanly separated out the build images from the runtime images? For example: put them (the Docker.build) into their own separate directories, with their own pom.xml build structure, etc.

(This would probably help our own organisation of the images as well, I'm thinking, so am already considering making the change.)

schannamallu commented 4 years ago

@schannamallu apologies, I'm not sure what the settings.xml does in regards to the Jenkins process (?) What we'd ideally have is some way of avoiding Jenkins going through the first Docker.build image construction unless there is a change to that Docker.build file (if there is no change to this file, because we're building the docker image from downloaded content that will be static (not dynamically-changing code from a Git repo), there will be no change to the resulting image -- so no reason to re-build it).

The only repeated re-building we should be doing is the Dockerfile piece of the split: which should pickup the pre-built (or very rarely re-built) image that results from the Docker.build process directly from Docker hub as its first step. This is a potentially significant download, but not any actual "work".

(In fact, even for these non-Egeria images I think we should only bother re-running the build if the Dockerfile itself changes.)

Leads to a couple of questions:

  1. Is there some way we can configure Jenkins to only run these Docker builds if some limited set of files in the source tree are changed, rather than on every commit / merge?
  2. I've also noticed that currently the apache-atlas image build is consistently failing, which appears to be down to network timeouts against www.apache.org -- which I can only assume is because we've somehow been blocked / blacklisted (the site always seems to be up and immediately responsive from my end)... Any ideas there?

Please find below comments:

  1. Yes, it is possible we can track the certain files or directories using a regex expressions for example https://github.com/lfit/releng-global-jjb/blob/master/jjb/lf-ci-jobs.yaml#L123

  2. I asked apache team to unblock the Jenkins ip(199.204.45.224) if they might block, from the logs they replied "199.204.45.224 2 month ago Too many repositoy.a.o visits (76080 >= 75000)" I can say we are excessing their limits.

My suggestion is

  1. Releng is maintaining nexus repo to avoid such issues, we can host the apache packages (tar.gz ,Keys) in the nexus repo and download the packages during build.
  2. If a new version appear in apache.org we need to upload new tar.gz to the nexus.
cmgrote commented 4 years ago

Yes, it is possible we can track the certain files or directories using a regex expressions

Excellent! I'm just doing some re-factoring so we should have clearly-delineated directories to be able to base the change triggering on -- I'll come back on this one once that refactoring is done. Just to confirm: some of the images (like the docker/egeria/Dockerfile) we would want to re-built on all commits, but most (eg. docker/apache-atlas/Dockerfile) we only want to re-build if the files under docker/apache-atlas/... change -- we can setup that level of granularity?

Releng is maintaining nexus repo to avoid such issues, we can host the apache packages (tar.gz ,Keys) in the nexus repo and download the packages during build.

Fine by me -- the Apache project releases against which we would build are fairly infrequent at the moment (multiple months in between), so I think this should be fairly easy to handle even manually for now...

schannamallu commented 4 years ago

@cmgrote @planetf1 The docker jobs are successful now.

https://jenkins.odpi.org/view/egeria-docker/

So I am closing the issue since its have been running from months. In case any issues please create new ticket.