[Bug]: plugin that build failed but still be assembled

ruanyl commented 1 month ago

Describe the bug

Checking this pipeline on build.sh step: https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10377/pipeline/151 The build of security-analytics was failed:

2024-10-10 02:27:14 ERROR    ERROR: Command 'bash /tmp/tmps7v24_y5/security-analytics/scripts/build.sh -v 3.0.0 -p linux -a x64 -s false -o builds' returned non-zero exit status 1.
2024-10-10 02:27:14 ERROR    Error building security-analytics, retry with: ./build.sh manifests/3.0.0/opensearch-3.0.0.yml --component security-analytics

However, the plugin was installed in step assemble.sh in https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10377/pipeline/963

2024-10-10 02:35:50 INFO     Installing security-analytics
2024-10-10 02:35:50 INFO     Executing "/tmp/tmpmiej8_16/opensearch-3.0.0/bin/opensearch-plugin install --batch file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip" in /tmp/tmpmiej8_16/opensearch-3.0.0
-> Installing file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip
-> Downloading file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip
-> Installed opensearch-security-analytics with folder name opensearch-security-analytics

Shouldn't the plugin be excluded if it failed to build?

I'm having runtime issue now running 2.18.0 and 3.0.0 docker image which looks related:

[2024-10-10T03:45:39,749][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] fatal error in thread [main], exiting
java.util.ServiceConfigurationError: org.apache.lucene.codecs.Codec: Provider org.opensearch.securityanalytics.correlation.index.codec.correlation950.CorrelationCodec950 could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586) ~[?:?]
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:813) ~[?:?]
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:729) ~[?:?]
    at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1403) ~[?:?]
    at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:68) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
    at org.apache.lucene.codecs.Codec.reloadCodecs(Codec.java:136) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
    at org.opensearch.plugins.PluginsService.reloadLuceneSPI(PluginsService.java:767) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.plugins.PluginsService.loadBundle(PluginsService.java:719) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.plugins.PluginsService.loadBundles(PluginsService.java:545) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.plugins.PluginsService.<init>(PluginsService.java:197) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.node.Node.<init>(Node.java:524) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.node.Node.<init>(Node.java:451) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-3.0.0.jar:3.0.0]
    at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-3.0.0.jar:3.0.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-3.0.0.jar:3.0.0]
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/codecs/lucene99/Lucene99Codec
    at org.opensearch.securityanalytics.correlation.index.codec.correlation950.CorrelationCodec950.<clinit>(CorrelationCodec950.java:14) ~[?:?]
    at java.base/jdk.internal.misc.Unsafe.ensureClassInitialized0(Native Method) ~[?:?]
    at java.base/jdk.internal.misc.Unsafe.ensureClassInitialized(Unsafe.java:1160) ~[?:?]

To reproduce

Run 2.18.0 and 3.0.0 opensearch docker image

Expected behavior

No response

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

peterzhuamazon commented 1 month ago

This is the same issue I described here:

https://github.com/opensearch-project/opensearch-build/issues/4625

The are multiple things happening here:

In build workflow, if you include incremental then a previous success copy of artifacts will be pulled from S3.
In ideal scenario if any plugin rebuilds, and failed, the run should stop there preventing a previous success copy to be used when a current build failed
In reality continue-on-error was co-enabled with incremental, so that if a plugin build failed, it will move on to the next plugin without failing the whole pipeline.
This creates a weird scenario where pluginA failed, but its previous good copy is on disk due to incremental, caused the build recording to record it into the build manifest, and the build recorder is in action because pipeline is not failling due to continue-on-error.
Assemble workflow starts with build manifest parsing, and will include the past success version as the current success version for a plugin that failed the current build and should be marked as failure.
Another edge case is someone didnt include pluginA in input manifest in 2.18.0, but due to this pluginA has a good copy in the previous 2.17.1, it is still pulled into the build from S3 by incremental, and treated as success in the build recorder and assemble workflow.

Involve @zelinh again to see if there is any better way to solve this. Probably remove the zips that is not in input manifest and the zips that is meant to be rebuild, to avoid cache polluting the new builds.

Thanks.

gaiksaya commented 1 month ago

I believe it was by design to include the previously built component (using incremental) if the new commit build for that plugin is failing. We could still have a complete bundle using previous commit which is very much nightly built artifact trait. Logging the failure needs to be better to get an idea what is being installed. If SA failed to built and previous copy is being installed that is expected and should be okay but needs to be informed to the user. Incremental and continue-on-error can go hand in hand. We do not want to fail entire workflow for a single component but also install if previous copy exists.

Also adding @dblock to get some suggestion on what should be the better approach.

ruanyl commented 1 month ago

@peterzhuamazon Thanks!

Probably remove the zips that is not in input manifest and the zips that is meant to be rebuild, to avoid cache polluting the new builds.

Will the build manifest contain the zips of the plugins which failed to build?
Isn't the zips uploaded to S3 are "versioned" by the build number? I can see the base url is https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/3.0.0/10377/linux/x64 guess this is where the zips are stored? If this is true, how does it resolve to an earlier zip when the current build is failed?

prudhvigodithi commented 1 month ago

Some more discussion related to the same topic is part of this issue https://github.com/opensearch-project/opensearch-build-libraries/issues/455.

ruanyl commented 1 month ago

Hi @gaiksaya, thanks!

I believe it was by design to include the previously built component (using incremental) if the new commit build for that plugin is failing.

I kinda get the point of doing this as We could still have a complete bundle. But just feel a broken docker image is worse than missing certain features. Perhaps for most cases using a previously build component won't result in runtime error, this is why it's by design? Btw, shall we also publish a docker image tag with build number? So that people can easily revert when encountering issue.

gaiksaya commented 1 month ago

[Triage] Previous discussion for this behavior https://github.com/opensearch-project/opensearch-build-libraries/issues/455#issuecomment-2286891453 Nightly artifacts are expected to be unstable/broken. That's how we catch issues and raise them with component teams. We are working on adding smoke tests at the distribution level that would detect if the given artifact is valid or not. Long term plan can be to put those artifacts under something /valid per version. Adding @zelinh who is working on smoke testing framework.

ruanyl commented 1 month ago

Nightly artifacts are expected to be unstable/broken.

Thanks @gaiksaya, that's faire point. When pushing the dock image tag, does it make sense to push a tag with the build number? That helps to revert to a previous valid version. Or any suggestion on how to revert now?

gaiksaya commented 1 month ago

What is the use-case here? Where are the docker images being used?

ruanyl commented 1 month ago

@gaiksaya I'm using docker image from https://hub.docker.com/r/opensearchstaging/opensearch, we use 3.0.0(main) or the current 2.18.0(2.x) to setup clusters for development/testing/demo env for OSD features on main/2.x branch.

gaiksaya commented 1 month ago

I would recommend to use validation workflow present in this repo to make sure the artifacts that you are deploying are valid. We are using similar one in nightly playgrounds workflow. However, recently I encountered a bug related to OSD https://github.com/opensearch-project/opensearch-build/issues/5117

dblock commented 4 weeks ago

I think as a consumer of any docker staging build I'd like to know:

What are the plugins that built successfully and were included in the build.
What are the plugins in the situation described in this issue, aka didn't build but a previous version was included.
Overall, is this a complete build without failures, meaning a potential beta/demo/release candidate.

opensearch-project / opensearch-build