Prometheus installer package does not appear to be working (vSphere)

cormachogan commented 3 years ago

Bug Report

Working my way through the various TCE package installations, I wanted to go through the steps to setup Prometheus. Deployed management cluster on vSphere, and then created a workload cluster (1 cp, 1 worker).

% tanzu cluster list --include-management-cluster
  NAME          NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES       PLAN
  tce-workload  default     running  1/1           1/1      v1.20.4+vmware.1  <none>      dev
  vcsa06-octoc  tkg-system  running  1/1           1/1      v1.20.4+vmware.1  management  dev

First tried installing Prometheus without changing any of the configuration. No errors reported, but no objects (namespace, pods, svcs) were created in my cluster.

I then exported the Prometheus config to review it. It had the required namespace and replica entries as per the docs - https://quirky-franklin-8969be.netlify.app/docs/latest/prometheus-config/. Only thing I did notice was StorageClassName was not set, so tried to deploy Prometheus with both the default settings and again with StorageClassName set to "default". Neither attempt resulted in any Prometheus objects getting created in the cluster.

To verify that the cluster was working successfully, I installed the fluent-bit package. This successfully created objects on the cluster so (a) I am possibly missing a pre-req step for Prometheus to successfully deploy (which might mean a doc update) or (b) there is an issue with the Prometheus package deployment.

As an aside, there seems to be no way to monitor a package deployment. We really need to have some way of monitoring what is happening during a package install to make troubleshooting possible.

Expected Behavior

I expected objects related to prometheus including alert manager to appear in my workload cluster.

I monitored the kapp-controller logs during the deployment, but I could see nothing out or the ordinary, nor were any errors displayed.

Steps to Reproduce the Bug

On a vSphere based workload cluster, run:

tanzu package install prometheus.tce.vmware.com

Environment Details

TCE version
```
v0.5.0
```

tanzu version

version: v1.3.0
buildDate: 2021-06-03
sha: b261a8b

kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+vmware.1", GitCommit:"d475bbd9e7cd66c6db7069cb447766daada65e3b", GitTreeState:"clean", BuildDate:"2021-02-22T22:15:46Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Operating System (client): macOS Big Sur version 11.4

stmcginnis commented 3 years ago

I wonder if this is related to https://github.com/vmware-tanzu/tce/issues/754

cormachogan commented 3 years ago

I wonder if this is related to #754

I don't think so. I'm not seeing anything created on my cluster. I did try with a default StorageClass, but I still don't see any attempt to create Prometheus objects on my cluster. I'm redeploying a new cluster now just to confirm the same behaviour.

cormachogan commented 3 years ago

Did a fresh cluster deployment - same behaviour unfortunately.

cormachogan commented 3 years ago

Found the issue - Prometheus is relying on an Ingress. Once I deployed the Contour package, Prometheus began to deploy. Probably worth adding some pre-req section to the docs stating that this is a requirement.

stmcginnis commented 3 years ago

Nice work tracking that down!

There is some design work going on right now to be able to define package dependencies. So basically having something like yum or apt where asking to install one package can be smart enough to know that it needs to pull in other packages to actually get something workable.

That work is happening in vmware-tanzu/tanzu-framework. Just mentioning it here as a breadcrumb for us to track that down and verify it meets the needs to address the issue called out here.

cormachogan commented 3 years ago

Yeah - that would be very useful @stmcginnis

jpmcb commented 3 years ago

Very useful thread! Thanks for tracking this down

cc: @LukeWinikates @akodali18 @hillrw3

cormachogan commented 3 years ago

FYI, Contour Ingress has a requirement on a LoadBalancer service, so you will also need to have something like the NSX ALB integrated with the workload cluster for this to work.

So the dependencies are Prometheus -> Ingress -> Load Balancer Service.

It would be useful to have this linked in some way, or at least prompts to say that there are dependencies.

qnetter commented 3 years ago

We are not currently planning on supporting NSX ALB at the MVP release. We are working to get MetalLB included.

cormachogan commented 3 years ago

That will make things much easier - thanks for the update

LukeWinikates commented 3 years ago

@cormachogan if you set ingress.enabled to false when installing the prometheus package, does the install then succeed for you?

In theory, the dependency on an ingress is "soft" in that if you opt out of the ingress you shouldn't need any of the ingress- related dependencies. If you don't need to access prometheus outside of your cluster, it should be fine to opt out of the ingress.

The questions in my mind are:

Should the ingress be disabled by default instead of enabled by default?
What's the best way to document the conditional dependency? It seems like we should call it out at the top of the README.md.

LukeWinikates commented 3 years ago

We are planning to open a new PR that will:

call out the contour dependency for the prometheus ingress to work
make the ingress off by default (opt-in instead of opt-out), with instructions to use port-forward for any ad-hoc prometheus ui access needs.

cormachogan commented 3 years ago

I think the setting it to false is a good idea @LukeWinikates as the current behaviour of just failing silently is not a good user experience. I think the idea of making it optional, and then adding the port forward instructions is a good one as well.

I would add as much info as possible into the configuration file as well, ideally linking to the official docs at GA.

LukeWinikates commented 3 years ago

Thanks @cormachogan for opening this issue and thanks @jpmcb for @ing us so that we saw it in a timely fashion.

The one thing that I think would still benefit from some attention is @cormachogan's description of how the errors preventing the deployment from succeeding aren't surfaced in any obvious way. I've heard that feedback about kapp-controller before. I wonder if it would make sense for the tanzu package install command to do something like:

block until the deployment has reconciled or failed (maybe with an optional flag? like --watch or --follow)
print a message with a different command that the user can run to watch the reconciliation process that way instead. Something similar to running kubectl apply -f to update a deployment and then kubectl rollout status to watch it happen.

vmware-tanzu / community-edition