opensearch-project / opensearch-build-libraries

Apache License 2.0
6 stars 25 forks source link

[BUG] Incremental build copy the entire distribution folder from S3 without cleaning up component zips that requires rebuild #455

Closed peterzhuamazon closed 1 month ago

peterzhuamazon commented 2 months ago

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven: https://github.com/opensearch-project/alerting/issues/1599

This means alerting build failed so new plugin zip will not copy to <distribution>/builds/opensearch/plugins, even if it build successfully after all. https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10015/pipeline/69

However, due to incremental build, even though alerting requires a rebuild, the entire distribution folder including older alerting plugin zips are being copied from S3.

Since new alerting zip did not override the old alerting zip, the old alerting zip was pushed again to S3 and treated as new alerting zip. And recorded into the build manifest.

This ends up creating a docker image with alerting installed, and that alerting zip is old, and cause jarhell: https://github.com/opensearch-project/security-analytics/actions/runs/9899572936/job/27348744950?pr=1156#step:6:124


#8 1.202 -> Installing file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 1.203 -> Downloading file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.175 -> Failed installing file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.175 -> Rolling back file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.183 -> Rolled back file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.183 Exception in thread "main" java.lang.IllegalStateException: failed to load plugin opensearch-security-analytics due to jar hell
#8 2.183    at org.opensearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:693)
#8 2.183    at org.opensearch.plugins.InstallPluginCommand.jarHellCheck(InstallPluginCommand.java:861)
#8 2.183    at org.opensearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:829)
#8 2.183    at org.opensearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:874)
#8 2.184    at org.opensearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:276)
#8 2.184    at org.opensearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:250)
#8 2.184    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
#8 2.184    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
#8 2.184    at org.opensearch.cli.MultiCommand.execute(MultiCommand.java:104)
#8 2.184    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
#8 2.184    at org.opensearch.cli.Command.main(Command.java:101)
#8 2.184    at org.opensearch.plugins.PluginCli.main(PluginCli.java:66)
#8 2.184 Caused by: java.lang.IllegalStateException: jar hell!
#8 2.184 class: com.google.common.annotations.Beta
#8 2.184 jar1: /usr/share/opensearch/plugins/opensearch-job-scheduler/guava-32.1.3-jre.jar
#8 2.184 jar2: /usr/share/opensearch/plugins/opensearch-alerting/guava-32.0.1-jre.jar
#8 2.184    at org.opensearch.bootstrap.JarHell.checkClass(JarHell.java:316)
#8 2.184    at org.opensearch.bootstrap.JarHell.checkJarHell(JarHell.java:215)
#8 2.184    at org.opensearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:675)
#8 2.184    ... 11 more
#8 ERROR: process "/bin/sh -c /usr/share/opensearch/bin/opensearch-plugin install --batch file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip" did not complete successfully: exit code: 1

Even though this issue has been fixed a month ago: https://github.com/opensearch-project/alerting/pull/1571

We should just remove the rebuilding plugin's zip after the initial S3 copy to avoid this invalid recording: https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/distribution-build-opensearch/runs/10015/nodes/69/steps/484/log/?start=0

Thanks.

peterzhuamazon commented 2 months ago

Hi @zelinh, Would you mind taking a look as he implement the incremental build.

Thanks

zelinh commented 2 months ago

I think according to our build workflow by default the new build plugin would override the previous artifacts. This might be a windows platform specific issue.

peterzhuamazon commented 2 months ago

This is not the windows issue but rather conflict with --continue-on-error.

zelinh commented 2 months ago

Yes. Discussed with @peterzhuamazon offline and this is more like a conflict with our 'continue-on-error' argument in build workflow. When one non-essential plugin failed, the build workflow will continue so there is no new artifact to override the old one. Therefore assemble workflow will assemble on the old artifacts and cause issue.

prudhvigodithi commented 2 months ago

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven

@zelinh @peterzhuamazon With this the alerting plugin should be treated as failure right and I can see this in the logs Error building alerting, but how come the incremental build fetching this? Even though with continue-on-error, it should continue to build the next plugin in the manifest. Th incremental should not even pull the past failed alerting plugin right ?

zelinh commented 2 months ago

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven

@zelinh @peterzhuamazon With this the alerting plugin should be treated as failure right and I can see this in the logs Error building alerting, but how come the incremental build fetching this? Even though with continue-on-error, it should continue to build the next plugin in the manifest. Th incremental should not even pull the past failed alerting plugin right ?

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

prudhvigodithi commented 2 months ago

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

Right, it should not even pull here in case for alerting right Zelin? Since the alerting build failed (even though the artifacts are pushed to s3) the incremental should not even pull the alerting right ?

peterzhuamazon commented 2 months ago

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

Right, it should not even pull here in case for alerting right Zelin? Since the alerting build failed (even though the artifacts are pushed to s3) the incremental should not even pull the alerting right ?

It pulled, because of s3 copy beforehand and the cache is not removed. So the recorder just record whatever in the tar folder. The old cache was mistakenly treated as new because override did not happen.

prudhvigodithi commented 2 months ago

Got it thanks @peterzhuamazon and @zelinh , I might be missing something here, but if the past build is red (failed) for a component (here in this case for alerting), it should not even look for cache/recorder etc, it should blindly start the new build next time with incremental flag.

peterzhuamazon commented 2 months ago

Got it thanks @peterzhuamazon and @zelinh , I might be missing something here, but if the past build is red (failed) for a component (here in this case for alerting), it should not even look for cache/recorder etc, it should blindly start the new build next time with incremental flag.

This is more like a side effect of jenkins, as we use s3 copy function to copy the entire tar folder from s3 to agent. We need to look into cleaning up plugins that are labeled to rebuild instead of blindly copy the entire tar folder over.

Thanks.

zelinh commented 1 month ago

I'm able to reproduce this issue when enabling both -continue-on-error and incremental. I will look into this and see how we could make them compatible in this case.

zelinh commented 1 month ago

After discussed offline with @peterzhuamazon, I believe this behavior is by design. When new commits coming in, if only enable --incremental, failing plugin will stop the build from creating new bundle since the build workflow will fail. If enable both --incremental and --continue-on-error, non-essential failing plugin will be ignored and continue on building other plugins. The new bundle will be created with previous successful built plugin(s) if applicable. We will also cut a build failure ticket to respective plugin repo and mark the distribution-build job Unstable. In this situation, the plugin with new commits will still be included in the bundle if it passed before but the previous built artifacts may not include their latest commits. Plugin team will need to actively fix any of the build errors based on the GitHub issue we cut to them.

Closing this issue for now.