Closed pohly closed 2 years ago
@pohly do you have prometheus metrics enabled in your operator? If so, is the prometheus stack set up in the cluster. Just checking to see if that is the cause of the issue. Also can you please mention the logs from the installplan if it exists in the cluster.
Hi @pohly,
Could you please let us know?
a) Have we enabled the metrics in your project? if yes, have you installed Prometheus on the cluster? b) Could you provide the project/link for us to try to reproduce your scenario? c) Could you please try to get the details of the deployment for we check if there has a reason for the failure as well?
a) Have we enabled the metrics in your project? if yes, have you installed Prometheus on the cluster?
The operator supports metrics collection and has a "metrics" port - see the manual deployment YAML. Is that what you mean with "enabled the metrics"?
Prometheus is not installed.
b) Could you provide the project/link for us to try to reproduce your scenario?
The project is https://github.com/intel/pmem-csi. First bring up some Kubernetes cluster without OLM installed and set KUBECONFIG
.
Then run:
git clone https://github.com/intel/pmem-csi.git
cd pmem-csi
git checkout v1.0.1
sed -i -e 's/OLM_VERSION=v0.18.3/OLM_VERSION=v0.19.1/' test/start-stop-olm.sh
make _work/bin/operator-sdk-1.6.1 operator-generate-bundle
test/start-stop-olm.sh start # you can ignore errors about `_work/pmem-govm/ssh.0`, that's just for diagnostics
kubectl create ns pmem-csi
TEST_BUILD_PMEM_REGISTRY=localhost:5001 TEST_PMEM_REGISTRY=172.17.42.1:5001 TEST_LOCAL_REGISTRY=172.17.42.1:5001 TEST_LOCAL_REGISTRY_SKIP_TLS=true test/start-operator.sh -olm
This needs a Docker registry. You can use some external one like quay.io. In this example, I am running one on port 5001 of my build machine, which can be reached via 172.17.42.1 from inside the cluster. It doesn't use TLS.
Once test/start-operator.sh
fails, you need to clean up before trying again:
test/stop-operator.sh -olm
c) Could you please try to get the details of the deployment for we check if there has a reason for the failure as well?
Which details do you need?
Also can you please mention the logs from the installplan if it exists in the cluster.
This is the installplans.operators.coreos.com CRD, right? There is no object of that kind after the failure and also none while operator-sdk is running.
Hi @pohly,
Following are some comments inline:
The operator supports metrics collection and has a "metrics" port - see the manual deployment YAML. Is that what you mean with "enabled the metrics"?
You enable the metrics in the config/default/kustomization https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v3/memcached-operator/config/default/kustomization.yaml#L24-L25 with its layout. However, by looking at your project you deviated from the proposed layout. Also, please be aware of: https://sdk.operatorframework.io/docs/faqs/#can-i-customize-the-projects-initialized-with-operator-sdk
Prometheus is not installed.
If the Operator is integrated with OLM and the bundle has a PodMonitor
or a ServiceMonitor
(which will be the case when you enable to metrics) the complete InstallPlan
will fail on a cluster, which does not have these CRD/the Prometheus operator installed. In this case, you might want to ensure the dependency requirement with OLM dependency or make clear its requirement for the Operator consumers.
If your bundle are you shipping the ServiceMonitor such as https://github.com/operator-framework/operator-sdk/blob/master/testdata/go/v3/memcached-operator/bundle/manifests/memcached-operator-controller-manager-metrics-monitor_monitoring.coreos.com_v1_servicemonitor.yaml#L1-L2 then the installPlan would fail.
That would happen with any CRD/API required for the operator works that do not exist on the cluster.
Could you please provide the bundle that are you using + ensure that the Operator image is published in a public space so we are able to easily reproduce the issue by running operator-sdk run bundle
with?
This project existed before we started adding an operator. That determined the layout. We understand that this not how the SDK is normally meant to be used, but rewriting the entire project wasn't ideal either. Thanks for any assistance that you can provide here despite the unusual approach.
Looking at what you said about enabling metrics my conclusion is that we don't enable those.
Here's the bundle content:
I pushed the image to docker.io/pohly/pmem-csi-bundle:v1.0.1
.
Hi @pohly,
Would possible to just add the zip/dir with the bundle content? Otherwise, we need to copy and paste and manually generate it to try to test and see if we can help you out.
Would possible to just add the zip/dir with the bundle content? Otherwise, we need to copy and paste and manually generate it to try to test and see if we can help you out.
Do you still need that when the generated image is available (see docker.io/pohly/pmem-csi-bundle:v1.0.1
)?
But I can of course also attach the original bundle files: bundle-1.0.1.tar.gz
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
@camilamacedo86 is there anything further that I can do to investigate this?
We've not had less luck lately in our periodic CI runs with OLM 0.18.3 because of new failure (operatorhubio-catalog-lbdrz 0/1 CrashLoopBackOff
). This didn't occur earlier. We would update to a new version, but this issue here is a blocker.
Was a subscription actually created? I'm not super familiar with run bundle, but I assume that it's basically generating a catalog source at runtime and then creating a subscription pointing to the catalog it generated. if no installplan was generated, it could be a resolution error or a problem getting content from that catalog.
Was a subscription actually created?
I don't know. How do I check?
Our CI jobs capture the output of all pods, perhaps that would help? Unfortunately the ones with the more recent OLM expired. Let me kick one off once more...
I don't know. How do I check?
The only real way to install an operator is to create a Subscription resource -- that's the entrypoint API used to install an operator with OLM. I'm not an sdk developer (I work on OLM), but I am making the assumption that this run bundle error is happening because that Subscription is failing. So, if we look at the status of the subscription, it may give us a better idea of why the install is failing.
They provide the bundle files here: https://github.com/operator-framework/operator-sdk/issues/5410#issuecomment-983007339
@rashmigottipati @jmrodri could we try to check this one? I think we need to try to reproduce the issue and check the subscription status to know what is wrong here.
WDYT about we add this one in a milestone to be checked with the latest release?
I tried with operator-sdk 1.19.1 and OLM 0.20.0, with Kubernetes 1.19 and 1.22. The "install plan is not available for the subscription pmem-csi-operator-v100-0-0-sub" occurred for both.
Attached is the log output from https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1071/2/artifact/joblog-jenkins-pmem-csi-PR-1071-2-test-1.22.log
Beware that some output gets dumped repeatedly, for example:
operatorhubio-catalog-6rbbt/registry-server@pmem..ker1: time="2022-04-14T17:05:17Z" level=info msg="serving registry" configs=/configs port=50051
operatorhubio-catalog-6rbbt/registry-server@pmem..ker1: time="2022-04-14T17:05:18Z" level=info msg="shutting down..." configs=/configs port=50051
I also tried with operator-sdk 1.18.0 and OLM 0.20.0. That worked once (Kubernetes 1.22) and failed once (1.19). It seems to be a bit random, but usually it fails reliably.
Update: I've been successful when installing OLM 0.20.0 on a fresh test cluster. The failure only seems to occur when the cluster has been in use for a while, i.e. several other tests not involving OLM ran earlier.
We have the same issue with operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6
I could also reproduce with: (SDK master branch)
Following the steps
$ operator-sdk run bundle quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6
INFO[0014] Successfully created registry pod: quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6
INFO[0014] Created CatalogSource: hive-operator-catalog
INFO[0014] OperatorGroup "operator-sdk-og" created
INFO[0014] Created Subscription: hive-operator-v2-5-3508-6cb94c6-sub
FATA[0120] Failed to run bundle: install plan is not available for the subscription hive-operator-v2-5-3508-6cb94c6-sub: timed out waiting for the condition
And then, by checking the bundle logs: (kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6)
$ kubectl logs pod/quay-io-operatorhubio-hive-operator-v2-5-3508-6cb94c6
time="2022-05-11T00:46:00Z" level=warning msg="\x1b[1;33mDEPRECATION NOTICE:\nSqlite-based catalogs and their related subcommands are deprecated. Support for\nthem will be removed in a future release. Please migrate your catalog workflows\nto the new file-based catalog format.\x1b[0m"
time="2022-05-11T00:46:00Z" level=info msg="adding to the registry" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]"
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional dependencies file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=info msg="Could not find optional properties file" file=bundle_tmp1603466453/metadata load=annotations with=./bundle_tmp1603466453
time="2022-05-11T00:46:01Z" level=error msg="permissive mode disabled" bundles="[quay.io/operatorhubio/hive-operator:v2.5.3508-6cb94c6]" error="error loading bundle into db: FOREIGN KEY constraint failed"
Error: error loading bundle into db: FOREIGN KEY constraint failed
Usage:
opm registry add [flags]
Also, we found the same above issue by using the operator-sdk run bundle-upgrade
, see: https://github.com/k8s-operatorhub/community-operators/runs/6364587418?check_suite_focus=true#step:3:7120 (More info: https://github.com/k8s-operatorhub/community-operators/issues/1195 )
It seems to be an issue associated with OPM.
We could find a way to reproduce the issue with OPM and without SDK. So, we raise an issue for we get it fixed in OPM: https://github.com/operator-framework/operator-registry/issues/952
This issue shows to be the scenario clarified and tracked via: https://github.com/operator-framework/operator-sdk/issues/5773. To avoid duplication and centralize the info, it seems like we can close this one in favour of https://github.com/operator-framework/operator-sdk/issues/5773.
Note that some workarounds were also proposed in the issue: https://github.com/operator-framework/operator-sdk/issues/5773. Please, ensure that you check if the proposed workarounds can help you out. If you check that your problem is not the same scenario, we would like to ask for you re-open this issue.
Thank you for your attention and collaboration.
That is not the root cause of the failure that I ran into with PMEM-CSI. If I install OLM on a fresh test cluster, running the bundle works. If I do the same thing after the cluster has been in use for a while, it fails for the same bundle.
I can't tell from the log files (see my earlier comments) what might be going wrong. As I have a workaround (run OLM tests first), I am not going to reopen this issue unless it pops up again.
Bug Report
I also reported this in https://github.com/operator-framework/operator-lifecycle-manager/issues/2454 but as this might also be an issue in operator-sdk, let me also file an issue here.
What did you do?
operator-sdk olm install
operator-sdk run bundle
What did you expect to see?
The operator should start to run.
What did you see instead? Under which circumstances?
This only happens with OLM 1.19.1. The same commands work when installing OLM 0.18.3 with
operator-sdk olm install --version=v0.18.3
. UPDATE: there is some randomness involved and it may depend on cluster load and/or state, see https://github.com/operator-framework/operator-sdk/issues/5410#issuecomment-1099581077 and https://github.com/operator-framework/operator-sdk/issues/5410#issuecomment-1105476692.Environment
Operator type:
/language go
Kubernetes cluster type:
kubeadm in VMs with Kubernetes 1.21.1
$ operator-sdk version
operator-sdk version: "v1.15.0", commit: "f6326e832a8a5e5453d0ad25e86714a0de2c0fc8", kubernetes version: "1.21", go version: "go1.16.10", GOOS: "linux", GOARCH: "amd64"
$ go version
(if language is Go)go version go1.17.2 linux/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-27T08:53:39Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Additional context
I encountered this in PMEM-CSI, tracked there as https://github.com/intel/pmem-csi/issues/1050
More diagnostics:
Note the odd "AllCatalogSourcesHealthy: False". The catalog-operator pod here might be responsible for it (not sure) and reports an error (
E1123 15:36:52.688776 1 queueinformer_operator.go:290] sync {"update" "default/pmem-csi-operator-v100-0-0-sub"} failed: Operation cannot be fulfilled on subscriptions.operators.coreos.com "pmem-csi-operator-v100-0-0-sub": the object has been modified; please apply your changes to the latest version and try again
):This repeats a few times but then not anymore. Deleting that pod doesn't help, the recreated one has the same problem.
For comparison, here is the output with OLM 0.18.3. It has the same update error, so that might be a red herring: