litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.44k stars 698 forks source link

x509: certifcate signed by unknown authority error when starting subscriber #3528

Open johnqa opened 2 years ago

johnqa commented 2 years ago

What happened:

I have installed Litmus with Helm chart, and logged in the Portal.

The self-agent is i Pending, and the pod for litmusportal-subscriber fails with error "failed to confirm cluster", "Post \"https://litmusdns/backend/query\": x509: certifcate signed by unknown authority

My ingress spec looks like this:

spec:
  rules:
  - host: litmusdns
    http:
      paths:
      - backend:
          service:
            name: litmus-frontend-service
            port:
              number: 9091
        path: /
        pathType: ImplementationSpecific
      - backend:
          service:
            name: litmus-server-service
            port:
              number: 9002
        path: /backend/(.*)
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - litmusdns

What can be the problem for this error?

Thank you, John

gdsoumya commented 2 years ago

@johnqa if you are using custom domains/hosts with self signed certs you need to configure litmus with either the tls cert or use the SSL skip feature to skip ssl/tls verification. Or you can remove tls if you don't have a certificate configured.

johnqa commented 2 years ago

So I added SKIP_SSL_VERIFY to Subscriber deployment but now I have another error:

required key ACCESS_KEY missing value
gdsoumya commented 2 years ago

Is there a secret resource named agent-secret present in the agent ns? That should have the access key

johnqa commented 2 years ago

yes, the secret is there, but what can I do with it?

gdsoumya commented 2 years ago

kubectl get secret agent-secret -n <ns> -oyaml and share the output

johnqa commented 2 years ago

I have added to deployment config ACCESS_KEY and CLUSTER_ID, but now I have another error:

level=fatal msg="failed to parse cluster confirm data" data="<html>\r\n<head><title>405 Not Allowed</title></head>\r\n<body>\r\n<center><h1>405 Not Allowed</h1></center>\r\n<hr><center>nginx/1.21.6</center>\r\n</body>\r\n</html>\r\n" error="invalid character '<' looking for beginning of value"
gdsoumya commented 2 years ago

Can you try to do a fresh install with the skip SSL env var set from the very beginning in the manifest? I think there might be some issues in the manual changes

johnqa commented 2 years ago

I am deploying using litmus helm chart, and I don't see where in values.yaml I can put these values for subscriber.

gdsoumya commented 2 years ago

Use this block to add any arbitrary envs for the server https://github.com/litmuschaos/litmus-helm/blob/cdfc397e0e3795ad62266eaf12b6027f2a38759e/charts/litmus/values.yaml#L192

gdsoumya commented 2 years ago

Just add SKIP_SSL_VERIFY: "true" in the generic block

johnqa commented 2 years ago

I did it and the current error is:

level=fatal msg="failed to confirm cluster" data= error="Post \"http://litmus.dnsname.int/backend/query\": dial tcp 10.238.40.210:80: i/o timeout"
gdsoumya commented 2 years ago

Can you see if you can curl/wget that url from inside the cluster network? Maybe just start a bash pod in the cluster and try accessing that URL, if it doesn't work then there's some networking or domain setup issue

johnqa commented 2 years ago

Using curl I was not able to connect to http://litmus.dnsname.int/backend/query but i was able to connect to https://litmus.dnsname.int/backend/query

I have changed the ingress settings to have https instead of http and redeployed, but now the subscriber has again the error:

level=fatal msg="failed to parse cluster confirm data" data="<html>\r\n<head><title>405 Not Allowed</title></head>\r\n<body>\r\n<center><h1>405 Not Allowed</h1></center>\r\n<hr><center>nginx/1.21.6</center>\r\n</body>\r\n</html>\r\n" error="invalid character '<' looking for beginning of value"
gdsoumya commented 2 years ago

@johnqa to unblock yourself for now you can just update the URL to http://litmusportal-server-service:9002/query for the self-agent and continue. Also can you check the logs of the graphql server when the subscriber throws that error

johnqa commented 2 years ago

Using http://litmus-server-service:9002/query finally worked.

Now I am worried when I will have to add an external agent :)

Thank you, John

gdsoumya commented 2 years ago

Awesome so it's confirmed that the problem is with the domain name/tls cert settings. Imo if it is possible for you to just disable tls in ingress and try with http I think things should work fine.