opensearch-project / opensearch-build

🧰 OpenSearch / OpenSearch-Dashboards Build Systems
Apache License 2.0
141 stars 275 forks source link

[Bug]: CCR plugin remoteIntegTest are failing in deb and rpm distribution due to multiple clusters are not forming #4610

Open nisgoel-amazon opened 7 months ago

nisgoel-amazon commented 7 months ago

Describe the bug

In Cross Cluster Replication plugin remoteIntegTest are failing from 2.12 release onwards. We are getting java.net.ConnectException: Connection refused error while running these test at time of release activity. These errors are coming because while running these we create multi clusters to run integration tests. This pre setup of creating cluster is just creating one cluster at a time. We have seen in logs that when 2nd cluster is coming up openseach-build package is removing the previously created cluster. https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/integ-test/runs/7981/nodes/122/steps/765/log/?start=0

In above log we can see after 1st cluster return 200 and before creating 2nd cluster pre remove script in debian distribution remove the 1st cluster. Below are the lines printed in the above log file.

Removing opensearch (2.13.0) ...
Running OpenSearch Pre-Removal Script
Stop existing opensearch.service

To reproduce

We can replicate this by running this command


./test.sh integ-test manifests/2.13.0/opensearch-2.13.0-test.yml --paths opensearch=https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.13.0/latest/linux/x64/deb --component cross-cluster-replication

### Expected behavior

Above command should create the multiple clusters first before running the cross cluster integration tests.

### Screenshots

If applicable, add screenshots to help explain your problem.

### Host / Environment

_No response_

### Additional context

_No response_

### Relevant log output

_No response_
bbarani commented 7 months ago

Thanks for opening an issue. We will look in to it when we have bandwidth. CCR is a unique use case hence I would appreciate if your team can contribute the fix as before. We prioritize closing gaps for generic use cases but need your teams support to close specialized use cases. Let us know if you need any help.

peterzhuamazon commented 4 months ago

Will go ahead and ignore ccr test for deb and rpm after discussion with Nandan Kumar, he will PR.

peterzhuamazon commented 2 months ago

Hi @nisgoel-amazon is there any progress on making CCR testing on remote cluster for deb and rpm?

Thanks.

nisgoel-amazon commented 2 months ago

@peterzhuamazon This needs an infra side change, we need help from infra team to understand why multi clusters are not coming up on same node in deb and rpm. We had analysed why ccr repo tests are failing on deb and rpm. Can you help us in scoping down the effort for this issue.

Then i think @ankitkala can align someone to pickup the change.

nisgoel-amazon commented 2 months ago

@peterzhuamazon can you confirm on one thing, as of today can we create multi node cluster on same node in deb and rpm? Means ES process running on different ports to form cluster on single node in deb and rpm?

peterzhuamazon commented 2 months ago

Not unless you significantly / heavily modify the existing deb/rpm package, you cant run multiple instance of that on a single host. You have to run them on multiple hosts, which probably require a cdk to set things up just for CCR on deb/rpm.

If you try to modify the pkg it defeat the purpose of integTest because you are testing something that will not be used by the customer in the same way.

nisgoel-amazon commented 2 months ago

No, its not like that we will defeat the purpose of integ test as we need 2 clusters to run CCR plugin. It doesn't matter whether we are running 2 clusters on different host or we configure 2 clusters on different ports on same host.

We are doing same thing in win and tar distributions too and that is serving our purpose.

Can you suggest how can we setup CDK to run CCR on deb/rpm.

peterzhuamazon commented 2 months ago

You misunderstand, our current integTest framework is specifically running every test on 1 host, which you cannot do for CCR on deb and rpm.

If you want it to work for CCR, you have to:

  1. Either heavily modify the deb/rpm pkg so they can run multiple instances on the same host
  2. Or implement multi-host in our integTest architecture, using CDK is just an example since our opensearch-build code is designed to run on single host.
  3. Or create a specific jenkins workflow and modify opensearch-build in a way, so that you can deploy to multiple jenkins agents/containers, while retrieving IPs of each agent/container, and test remotely.

The reason I suggest cdk is because of its ease of retrieving separate host IPs so you can do the test remotely. I am still not sure what would be the change to make this happen, as CCR team has more expertise in how CCR test works.

Happy to have more discussion on this via call.

Thanks.

nisgoel-amazon commented 2 months ago

I had a word with @peterzhuamazon on this one, we have multiple ways to fix this problem. Peter suggested to have our own infra via CSK and then make changes in opensearch-build to pass those node ip's to run our remote-test.