redhat-cop / global-load-balancer-operator

A global load balancer operator for OpenShift
Apache License 2.0
53 stars 16 forks source link

Global Load Balancer Pod is crashing #26

Closed makaraju closed 3 years ago

makaraju commented 3 years ago

Hi @raffaelespazzoli ,

We are trying to use the Global Load Balancer operator in our environment. When we tried to create the Global DNS Record, the pod is getting crashing. Please help us to figure out the issue. We have troubleshoot the code and we found that is is failing in getIPs. Please check the below log.

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1800cd4]

goroutine 263 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x19d1e80, 0x2a188c0)
    /usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*EndpointStatus).getIPs(0xc006ee9b00, 0xc006eecfd0, 0xc006ee7f98, 0xc0049bb320, 0x0, 0x0)
    /workspace/controllers/globaldnsrecord/endpointstatus.go:41 +0x174
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).getAWSTrafficPolicyDocument(0xc0001ae230, 0xc000634340, 0xc006eecfd0, 0xc002b18338, 0x0, 0x0, 0x0, 0x0)
    /workspace/controllers/globaldnsrecord/route53provider.go:520 +0x29aa
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).createAWSTrafficPolicy(0xc0001ae230, 0xc000634340, 0xc006eecfd0, 0xc002b18338, 0x0, 0x0, 0x1a7cac0, 0x17fe51f)
    /workspace/controllers/globaldnsrecord/route53provider.go:322 +0x85
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).ensureRoute53TrafficPolicy(0xc0001ae230, 0xc000634340, 0xc002b18338, 0xc006eecfd0, 0xc002b18338, 0x0, 0x0, 0xc00430f7c0)
    /workspace/controllers/globaldnsrecord/route53provider.go:201 +0xa45
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).createRoute53Record(0xc0001ae230, 0x1ecd620, 0xc000930270, 0xc000634340, 0xc006eca160, 0xc006eecfd0, 0xa, 0xc006ef62c0, 0x1d, 0xc006ec0db0)
    /workspace/controllers/globaldnsrecord/route53provider.go:45 +0x371
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).Reconcile(0xc0001ae230, 0x1ecd620, 0xc000930270, 0xc00034b160, 0x1d, 0xc00034b0e0, 0x20, 0xc000930270, 0x40a1bf, 0xc000030000, ...)
    /workspace/controllers/globaldnsrecord/globaldnsrecord_controller.go:166 +0xf17
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ba2140, 0x1ecd560, 0xc00035a000, 0x1a41f00, 0xc000ce05e0)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263 +0x317
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ba2140, 0x1ecd560, 0xc00035a000, 0x0)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1ecd560, 0xc00035a000)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0009db750)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc006eedf50, 0x1e8f720, 0xc0009300c0, 0xc00035a001, 0xc000844300)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0009db750, 0x3b9aca00, 0x0, 0x1, 0xc000844300)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1ecd560, 0xc00035a000, 0xc0009200a0, 0x3b9aca00, 0x0, 0x1cd4801)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1ecd560, 0xc00035a000, 0xc0009200a0, 0x3b9aca00)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:195 +0x4e7

Kindly let me know, if you need any information.

Regards, Hari

raffaelespazzoli commented 3 years ago

@makaraju thanks for signaling g this issue. I think the issue is pretty clear and I know how to fix it. What I can't figure out is why not every installation is failing like that. So let me ask you a few questions:

  1. which version of the operator are you running?
  2. did you change the default log level?
  3. which dns option are you using?
  4. can you share the GlobalDNSRecord that is causing the issue?
raffaelespazzoli commented 3 years ago

a tentative fix is here: https://github.com/raffaelespazzoli/global-load-balancer-operator/tree/fix%2326 would you be able to confirm that it is working by building the operator from that branch? Please follow the local development instructions: https://github.com/redhat-cop/global-load-balancer-operator#running-the-operator-locally

makaraju commented 3 years ago

@raffaelespazzoli thanks for the quick response. Let me deploy and test again.

makaraju commented 3 years ago

Hi @raffaelespazzoli

It passed that error, but it's unable to create the traffic policy. Getting the below error.

2021-03-02T15:41:31.440-0600    ERROR   controllers.GlobalDNSRecord unable to create    {"network policy": "{\n  Document: \"{\\\"AWSPolicyFormatVersion\\\":\\\"2015-10-01\\\",\\\"RecordType\\\":\\\"A\\\",\\\"StartRule\\\":\\\"main\\\",\\\"Rules\\\":{\\\"main\\\":{\\\"RuleType\\\":\\\"multivalue\\\"}}}\",\n  Name: \"global-load-balancer-operator/route53-multivalue-global-record\"\n}", "error": "InvalidTrafficPolicyDocument: At least one endpoint must be declared.;main: Multivalue rules must specify at least two items.\n\tstatus code: 400, request id: 8896b985-bc02-4cb2-a16a-1da6fcfb8ffb"}
github.com/go-logr/zapr.(*zapLogger).Error
    /Users/hmakara/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
github.com/redhat-cop/global-load-balancer-operator/controllers/globaldnsrecord.(*GlobalDNSRecordReconciler).createAWSTrafficPolicy

I have few questions.

  1. Do we need to create traffic policy or operator will take care automatically?
  2. If it will create automatically then why the policy doesn't have any endpoints?

Just to answer your above questions.

  1. which version of the operator are you running? v1.0.0
  2. did you change the default log level? No
  3. which dns option are you using? AWS Route53
  4. can you share the GlobalDNSRecord that is causing the issue?
    apiVersion: redhatcop.redhat.io/v1alpha1
    kind: GlobalDNSRecord
    metadata:
    name: route53-multivalue-global-record
    spec:
    name: multivalue.<route53 domain name>
    endpoints:
    - clusterName: cluster-1
    clusterCredentialRef:
      name: glb-local
      namespace: global-load-balancer-operator
    loadBalancerServiceRef:
      name: argo-server
      namespace: argo
    - clusterName: cluster-2
    clusterCredentialRef:
      name: glb-remote
      namespace: global-load-balancer-operator
    loadBalancerServiceRef:
      name: argo-server
      namespace: argo
    ttl: 60
    loadBalancingPolicy: Multivalue
    globalZoneRef:
    name: route53-global-dns-zone
raffaelespazzoli commented 3 years ago

@makaraju can you share the

      name: argo-server
      namespace: argo

service?

Do we need to create traffic policy or operator will take care automatically? it will be created automatically If it will create automatically then why the policy doesn't have any endpoints? the operator thinks you don't have any endpoints, that's why I want to see the service.

makaraju commented 3 years ago

@raffaelespazzoli argo-server is service name. This service is pointed to the routes.

raffaelespazzoli commented 3 years ago

can I see the yaml? what does it mean that the service is pointed to the routes? Can I also see those routes' yaml?

makaraju commented 3 years ago

Service yaml:

apiVersion: v1
kind: Service
metadata: 
  name: argo-server
  namespace: argo
spec: 
  ports:
    - name: web
      port: 2746
      targetPort: 2746
  selector:
    app: argo-server

routes yaml:

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: argo-wf-ui
  namespace: argo
spec:
  host: >-
    argo-wf-ui-argo.<cluster-name>.<route53 domain>
  to:
    kind: Service
    name: argo-server
    weight: 100
  port:
    targetPort: web
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect
  wildcardPolicy: None
raffaelespazzoli commented 3 years ago

so this is the problem: you need to set the field loadBalancerServiceRef service field to a load balancer service not an internal cluster service. The operator will use that information to discover the externally exposed endpoint. So in this case you need to set there the loadbalancer service supporting the routers. Alternatively and perhaps more easily you could configure a global route autodiscovery and simply annotate that route to be global. Also the route is wrong as it needs to point to the global domain you created and instead it is pointing to the cluster local domain.

makaraju commented 3 years ago

thanks for the information. Just have one question.

  1. GLB can be used for internal services with OCP routes instead of creating load balancer service?
raffaelespazzoli commented 3 years ago

no, it does not make sense, an external global load balancer has no visibility of services exposed in the SDN only.

raffaelespazzoli commented 3 years ago

may I close this?