openshift / windows-machine-config-operator

Windows MCO for OpenShift that handles addition of Windows nodes to the cluster
Apache License 2.0
40 stars 68 forks source link

Service and Endpoints for the node exporters are not correctly configured #826

Open SSvilen opened 2 years ago

SSvilen commented 2 years ago

The windows node exporter is installed on all windows worker nodes, but the required Service and Endpoint resources are no created at all. There is a service object created, but it's from type ClusterIP, which in this case won't work. The Service should be of type 'ExternalName' and the Endpoints should be updated by the operator on every node join/deletion operation. For instance:

apiVersion: v1
kind: Service
metadata:
  labels:
    name: windows-exporter
  name: windows-exporter
  namespace: openshift-windows-machine-config-operator
spec:
  type: ExternalName
  ports:
    - name: metrics
      port: 9182
      protocol: TCP
      targetPort: 9182
  externalName: nodexporter
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    name: windows-exporter
  name: windows-exporter
  namespace: openshift-windows-machine-config-operator
subsets:
  - addresses:
      - ip: 1.1.1.1
        targetRef:
          kind: Node
          name: winmach-q84jj
          uid: ab8028e7-a0ed-4f83-89e5-b577be2231ed
      - ip: 1.1.1.1
        targetRef:
          kind: Node
          name: winmach-t5vgm
          uid: 1b710328-88d5-4142-a78f-dd414705cc19
    ports:
      - name: metrics
        port: 9182
        protocol: TCP
mansikulkarni96 commented 2 years ago

@SSvilen thanks for the provided information.

As you can see in the manifests/windows-exporter_v1_service.yaml, the type is not set to ClusterIP. The required service and endpoint names should be both windows-exporter as that name is used to get the resources in the operator code. I suspect monitoring is not enabled in the operator namespace. Please ensure label openshift.io/cluster-monitoring=true is present in the openshift-windows-machine-config-operator namespace which is required for monitoring resources to be created by WMCO in that namespace. If it is not enabled you can see a log like: install the prometheus-operator to enable Prometheus configuration in the WMCO logs. Community Operators have a checkbox to enable monitoring in the operator namespace, if you are building from source you can use oc label ns openshift-windows-machine-config-operator openshift.io/cluster-monitoring=true --overwrite to set the label. Let us know if that resolves the issue!

SSvilen commented 2 years ago

@mansikulkarni96,

The monitoring for the namespace is enabled. The problem is that the node exporter is installed on the windows worker nodes and it's not running as a pod, like it is for the linux based os. So prometheus operator can not properly discover the endpoint for that servicemonitor. So I had to recreate the service and manually create the endpoint object, which in turn points to the windows nodes. Or am I overthinking this?

mansikulkarni96 commented 2 years ago

@SSvilen Thanks for confirming that. The behaviour you see is the expected behaviour, windows-exporter runs as a Windows service on the Windows nodes which is different from it's linux counterpart due to support reasons. Prometheus operator should be able to discover the endpoint, you can take a look at the service_monitor.yaml, you can see how the re-labellings are applied to make the endpoint discoverable. What you are expecting is exactly what the operator does, it updates the endpoint objects on every node join/deletion operation, more details in metrics.go if you are interested in the code base. If you could provide Windows Machine Config Operator logs and details about the exact operator version, OCP version and the steps followed to reach this point, I should be able to help you out further.

SSvilen commented 2 years ago

OK, I see what's happening.

controller.windowsmachine    invalid Machine    {"name": "winmach-t5vgm", "error": "no internal IP address associated",

and based on the code in metrics.go an internal IP address is expected.

The status field of the machine shows type 'InternalDNS'

status:                                                                                                                                                                                                                                    
  addresses:                                                                                                                                                                                                                               
   - address: winmach-q84jj                                                                                                                                                                                                                 
     type: InternalDNS

I'm not sure why that is.

mansikulkarni96 commented 2 years ago

@SSvilen can you provide details about the WMCO version, cloud provider, OCP version and the Windows Server version used for the VM? This is what the support matrix looks like Supported Cloud Providers based on OKD/OCP Version and WMCO version and Supported Windows Server versions.

SSvilen commented 2 years ago

@mansikulkarni96 ,

WMCO 3.1 OCP 4.8 Windows 20H2

But it would be also beneficial, if there is a bit more logging. For instance here. That would make the troubleshooting easier.

mansikulkarni96 commented 2 years ago

@SSvilen logging info noted. According to your comment, the Windows worker node is present, is that added by using WMCO? If yes, then the "no internal IP address associated" error should have resolved on its own as the IP address is not just required for metrics but also for the SSH connection to the VM. I would request you some more information for the deubgging further:

  1. Cloud provider information: is it vmware vSphere?
  2. Node configuration method used here, provide info from one of the two:
  1. Full output of oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
  2. Windows MachineSet yaml/ configMap yaml depending on the Node configuration method used.
  3. Output of oc get network.operator cluster -o yaml
SSvilen commented 2 years ago

@mansikulkarni96 ,

1.Cloud provider information: is it vmware vSphere?

Yes.

2. Node configuration method used here, provide info from one of the two:
- BYOH
- machinesSet

machinesSet

network.txt operatorlogs.txt machineSet.txt

Thanks!

mansikulkarni96 commented 2 years ago

@SSvilen Thanks for providing the logs, from the operator logs I can see the IP address cannot be found to configure the Windows machine into a node. You should see the same issue if you try to oc describe the machine object, it is trying to configure. I suspect it has to do with the golden image creation for vSphere. Please make sure you have followed all the steps described in vsphere-golden-image.md

SSvilen commented 2 years ago

@mansikulkarni96,

ok thanks. We'll look at it again.

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

MattPOlson commented 2 years ago

@mansikulkarni96,

ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

mansikulkarni96 commented 2 years ago

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

SSvilen commented 2 years ago

@mansikulkarni96, ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

MattPOlson commented 2 years ago

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

The cluster is running in vsphere and we are using machinesets to provision the servers. If change the service to be of type 'ExternalName' and create an endpoint that includes the node it works fine, its just not happening automatically like it should.

MattPOlson commented 2 years ago

@mansikulkarni96, ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

Reverse DNS lookup works fine in our network, the internal IP still isn't being populated on the machine so the endpoint isn't being created.

ping -a 10.33..

Pinging k8s-se-** [10.33..] with 32 bytes of data: Reply from 10.33..: bytes=32 time=2ms TTL=121 Reply from 10.33..: bytes=32 time=2ms TTL=121

SSvilen commented 2 years ago

@MattPOlson ,

why do the logs from the operator say when you add a new machine? Are they BYOH or do you provision with machine sets?

MattPOlson commented 2 years ago

@MattPOlson ,

why do the logs from the operator say when you add a new machine? Are they BYOH or do you provision with machine sets?

Its throwing this error. I'm trying to figure out where/how in the code the operator gets the external IP address. They are provisioned as machine sets.

DEBUG controller.windowsmachine invalid Machine {"name": "k8s-se-platform-01-bq57b-win-lprdv", "error": "no internal IP address associated", "errorVerbose": "no internal IP address associated\ngithub.com/openshift/windows-machine-config-operator/controllers.getInternalIPAddress\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:523\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).isValidMachine\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:203\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).SetupWithManager.func2\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:114\nsigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Update\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/predicate/predicate.go:87\nsigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnUpdate\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:88\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}

mansikulkarni96 commented 2 years ago

@MattPOlson can you add the full WMCO log snippet? Those are just the initial debug logs, which should resolve themseleves once the IP for the machine is available.

MattPOlson commented 2 years ago

@mansikulkarni96 sure. windows-machine-config-operator-8dc56cbb7-wfhdh-manager.log

MattPOlson commented 2 years ago

Any updates on this, I feel like this is either a legit issue or something isn't documented correctly as far as the setup goes. I looked through the code but I can't figure out why the internal IP still isn't being populated on the machine so the endpoint isn't being created.

saifshaikh48 commented 2 years ago

@MattPOlson can I ask what OCP and WMCO version you are using? In the log you shared, I see some failures to watch/get the OperatorCondition k8s resource. The fix for this was backported to WMCO 3.1.1 and 4.0.1 for OCP 4.8 and 4.9 respectively.

MattPOlson commented 2 years ago

@saifshaikh48 sure: operator: community-windows-machine-config-operator.v4.0.1 cluster: 4.9.0-0.okd-2022-02-12-140851

saifshaikh48 commented 2 years ago

Interesting, that version should have the proper permissions.

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/windows-machine-config-operator/issues/826#issuecomment-1165845024): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
MattPOlson commented 2 years ago

This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes.

/reopen

openshift-ci[bot] commented 2 years ago

@MattPOlson: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/openshift/windows-machine-config-operator/issues/826#issuecomment-1182491853): >This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes. > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sebsoto commented 2 years ago

I'll look into this today

/reopen

openshift-ci[bot] commented 2 years ago

@sebsoto: Reopened this issue.

In response to [this](https://github.com/openshift/windows-machine-config-operator/issues/826#issuecomment-1183199075): >I'll look into this today > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
MattPOlson commented 2 years ago

windows-machine-config-operator-76cd78c4f5-45kv9-manager.log

sebsoto commented 2 years ago

Seeing

1.657650479492904e+09   DEBUG   events  Warning {"object": {"kind":"Namespace","name":"openshift-windows-machine-config-operator","uid":"6fabb20a-a268-4c58-8fc7-30e887bb7dce","apiVersion":"v1","resourceVersion":"27258196"}, "reason": "labelValidationFailed", "message": "Cluster monitoring openshift.io/cluster-monitoring=true label is not enabled in openshift-windows-machine-config-operator namespace"}

and

1.6576521713493032e+09  INFO    metrics install the prometheus-operator to enable Prometheus configuration

In the logs but the ns has the correct openshift.io/cluster-monitoring=true label on it

sebsoto commented 2 years ago

WMCO checks for metrics being enabled on the namespace its deployed only in at startup. WMCO ignores the change if metrics are enabled/disabled while WMCO is running.

Thinking about two potential options to fix this 1) WMCO watches the namespace and enables/disables its metrics functionality depending on the label 2) WMCO checks the namespace label anytime it needs to reconcile the endpoint object

mtnbikenc commented 2 years ago

/remove-lifecycle rotten

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sebsoto commented 1 year ago

This can be solved through https://issues.redhat.com/browse/WINC-545

sebsoto commented 1 year ago

/remove-lifecycle stale

sebsoto commented 1 year ago

/lifecycle frozen