redhat-best-practices-for-k8s / certsuite-sample-workload

This repository provides the infrastructure to create a Certsuite test partner Pod.
Apache License 2.0
3 stars 21 forks source link

deploy-partner-pods.sh should check if partner pod is already created #40

Closed ramperher closed 3 years ago

ramperher commented 3 years ago

In deploy-partner-pods.sh, while the namespace to be used is only created if it is not present, the partner pod does not have that check.

Having said that, I am experiencing problems with the test-network-function containerized version (v3.0.0), which seems to make use of the scripts placed in this repository to create the namespace and partner pod.

In my case, I created both namespace and partner pod before executing the test-network-function container. When executing deploy-partner-pods.sh within the container, I saw that the namespace was not created again because it already existed, but not the same for the partner pod, which was created again.

This is the log related to this. In fact, I was in a disconnected environment, so the second creation of partner pod failed because it could not retrieve the image from quay.

-k /home/dciteam/clusterconfigs-cluster5/kubeconfig
-t /var/lib/dci-pipeline/test_network_function/test-network-function
-o /var/lib/dci-pipeline/test_network_function/test-network-function
-f operator platform-alteration observability lifecycle networking access-control diagnostic 
The $TNF_CONTAINER_CLIENT environment variable is not set; defaulting to use: podman
Mounting 1 kubeconfig volume(s):
-v /home/dciteam/clusterconfigs-cluster5/kubeconfig:/usr/tnf/kubeconfig/config:Z
+ podman run --rm --network host -v /home/dciteam/clusterconfigs-cluster5/kubeconfig:/usr/tnf/kubeconfig/config:Z -v /var/lib/dci-pipeline/test_network_function/test-network-function:/usr/tnf/config:Z -v /var/lib/dci-pipeline/test_network_function/test-network-function:/usr/tnf/claim:Z -e KUBECONFIG=/usr/tnf/kubeconfig/config -e TNF_MINIKUBE_ONLY=false -e TNF_NON_INTRUSIVE_ONLY=True -e TNF_DISABLE_CONFIG_AUTODISCOVER=false -e TNF_PARTNER_NAMESPACE= -e TNF_OC_DEBUG_IMAGE_ID=registry.dfwt5g.lab:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f5ce898fbad3671fecd6f797130f950fb1abdbaf0d8154c22e2b59c74e3a918 -e LOG_LEVEL=debug -e PATH=/usr/bin:/usr/local/oc/bin quay.io/testnetworkfunction/test-network-function:latest ./run-cnf-suites.sh -o /usr/tnf/claim -f operator platform-alteration observability lifecycle networking access-control diagnostic -s
make: Entering directory '/usr/tnf-partner/src'
bash ./deploy-partner-pods.sh
namespace tnf already exists, no reason to recreate
deployment.apps/partner configured
pod/partner-8646fb4668-jdcnc condition met
error: timed out waiting for the condition on pods/partner-68cf756959-6cz96
make: *** [Makefile:8: install-partner-pods] Error 1
make: Leaving directory '/usr/tnf-partner/src'

The result is, I had two partner pods running in the tnf namespace. Then, when the suite was about to be executed, it printed the following error:

time="2021-09-23T12:04:48Z" level=fatal msg="failed to identify a single test orchestrator container: expected exactly one container, got 2 for label test-network-function.com/generic=orchestrator"

So, in the end, I was not able to execute the suites because there were two partner pods running, and I cannot rely on the one provided by the test-network-function container because we are working on a disconnected environment.

The proposed solution is to add a new check in deploy-partner-pods.sh to avoid recreating the partner pod if it is already present in the namespace (e.g. looking for pods with test-network-function.com/generic=orchestrator label).

jc-rh commented 3 years ago

The intent is to allow multiple partner pods as long as they are in different namespaces. The v3.0.0 container image you used is our last released version. To test with the latest, use the image tag option with run-tnf-continer.sh:

-i quay.io/testnetworkfunction/test-network-function:unstable

jc-rh commented 3 years ago

As to the partner pod image, yes, it would have to be mirrored locally first for disconnected environment. But you need that no matter whether running tnf binary or container, right?

ramperher commented 3 years ago

Hi Jun. I understand what you comment, but the issue is not related to that at all. The problem comes with what you are saying, "the intent is to allow multiple partner pods as long as they are in different namespaces". Let me explain my situation with that, because maybe I explained myself wrongly.

What I mean is, before running run-tnf-container.sh, I have already created a partner pod running in the desired namespace. However, run-tnf-container.sh calls deploy-partner-pods.sh, and this script creates a new partner pod in the same namespace, without checking if it is already created or not. Of course, it is fine to allow having multiple partner pods in different namespaces, but in this case, I have already one partner pod in one namespace, and the script tries to create a new one in the same namespace.

Due to this, it is happening two things in my scenario:

That's why I was saying if deploy-partner-pods.sh can include a check to avoid creating again the partner pod if it is already present in the targeted namespace, this would fix this issue.

Or also, I don't know if something similar like TNF_PARTNER_SRC_DIR variable can be added here, so that, if this variable is not set, the script is run without uptading the infrastructure (and nothing would be installed from deploy-partner-pods.sh). It is just a suggestion that comes up to my mind now, of course.

jc-rh commented 3 years ago

I get the second issue now for the container case. We will add a way of overriding the default image repo location. On the first issue, we do limit one partner pod per namespace right now. Maybe the one you have in the tnf namespace was created long time ago when it was still a standalone pod? If the name is just "partner" without the hash suffix, it would be the case. You can simply delete that one. We had changed it to a replica 1 deployment. Now it should be named like: partner-68cf756959-kr4k7

ramperher commented 3 years ago

In my case, I am using the same partner deployment you have defined here. The only difference is that I change the image to be used to point to our local registry, as we are in a disconnected environment.

The thing is, this partner deployment is created before running the containerized test suite. Then, when the suite is run, the partner deployment is created again. I guess that, what oc apply tries to do, is to create the new partner pod and, when it reaches the Running status, it deletes the previous one, in order to guarantee service availability. However, as I am in a disconnected environment, I cannot create the partner pod launched by the containerized CNF Cert Suite, so that it never reaches the running status, and two partner pods remain running in the same namespace.

In any case, I think that, by including this kind of logic that you already used in previous commits, it should also fix this problem. Or also, to check just the running partner pods and discard the ones that are not in running state (because, in my case, it is detecting two partner pods in the same namespace, but one of them is not in running state).

ramperher commented 3 years ago

With your last change included in https://github.com/test-network-function/cnf-certification-test-partner/pull/41, after testing the containerized tnf without deploying the partner pod beforehand, I can confirm it works and the partner pod is created correctly, so you can close this issue if you want (or I can do it tomorrow). Thanks for your support!