skupperproject / skupper

Skupper is an implementation of a Virtual Application Network, enabling rich hybrid cloud communication.
http://skupper.io
Apache License 2.0
579 stars 70 forks source link

Setting up skupper with NodePort #806

Open phalox opened 2 years ago

phalox commented 2 years ago

Hi Skupper team,

As we're running on digitalocean, the regular skupper init doesn't work (loadbalancers need to be set up separately). But in any case, we don't want to use the digitalocean loadbalancer. So we've tried a couple of other things:

  1. We have an nginx-ingress in use, but we want it to handle ssl for other services, so we cannot just bypass ssl as someone recommended here: https://github.com/skupperproject/skupper/issues/633
  2. So we then switched to using NodePorts, assuming it would be simpler.

On the main cluster: skupper init --ingress nodeport --ingress-host PUBLIC-DNS-ENTRY (we set up DNS entry with external-dns in cloudflare) On the 2nd cluster: skupper init --ingress none

Doing a curl returns this:

$ curl -k -v https://PUBLIC-DNS-ENTRY:31602
* Rebuilt URL to: https://PUBLIC-DNS-ENTRY:31602/
*   Trying xyz.xyz.xyz.253...
* TCP_NODELAY set
* Connected to .... port 31602 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Unknown (8):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Client hello (1):
* TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=skupper
*  start date: Jun 29 13:03:33 2022 GMT
*  expire date: Jun 28 13:03:33 2027 GMT
*  issuer: CN=skupper-local-ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* Using Stream ID: 1 (easy handle 0x5610f5e4c580)
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
> GET / HTTP/2
> Host: ............:31602
> User-Agent: curl/7.58.0
> Accept: */*
>
* TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
* TLSv1.3 (OUT), TLS Unknown, Unknown (23):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 401
< content-type: text/plain; charset=utf-8
< www-authenticate: Basic realm=skupper
< x-content-type-options: nosniff
< content-length: 13
< date: Wed, 29 Jun 2022 13:20:22 GMT
<
* TLSv1.3 (IN), TLS Unknown, Unknown (23):
Unauthorized

Which I believe means communication can happen? When I generate the token, I do see that another port (31696, going to 8081 on skupper service). That one doesn't seem to take HTTP requests.

I then set up the link from the 2nd cluster

$ skupper link create skupper.token
Site configured to link to https://...........:30696/308b1d34-f7af-11ec-9015-a66cade4fad5 (name=link1)
Check the status of the link using 'skupper link status'

But the link does not seem to get created:

$ skupper link status
Link link1 not active (Failed to redeem claim: No such claim)

What are we missing here? How can I debug this? I can curl from the 2nd cluster to the main cluster on the skupper port (the 8080 one).

grs commented 2 years ago

We have an nginx-ingress in use, but we want it to handle ssl for other services, so we cannot just bypass ssl as someone recommended here: https://github.com/skupperproject/skupper/issues/633

Enabling passthrough on the controller does not mean that all other uses will then do passthrough. Passthrough needs to be explicitly requested for a given ingress through an annotation.

grs commented 2 years ago

By default, when you create the token it creates a token of type 'claim'. This is exchanged at the time of linking for an actual TLS certificate. However by default the claim expires after 15 mins. The 'no such claim' error suggests that the record of the claim may have been deleted by the time it was used. Is that possible? if you create a new token and immediately try to use it, does that work?

phalox commented 2 years ago

Great, that 15min timeout was indeed the issue. Can I somehow help improve the docs as this was not clear to me. Also, I assume that the token can only be used once?

So, I now see the pods exposed in the 2 k8s clusters. On the cluster that generated the token (and allows for incoming connections - the AMS3 cluster), I see:

kubectl get pods
NAME                                          READY   STATUS    RESTARTS       AGE
cockroachdb-ams3-0                            1/1     Running   1 (36h ago)    2d19h
cockroachdb-ams3-1                            1/1     Running   1 (6h5m ago)   2d19h
cockroachdb-ams3-2                            1/1     Running   1 (10h ago)    2d19h
cockroachdb-ny3-0                             1/1     Running   0              4m17s
cockroachdb-ny3-1                             1/1     Running   0              4m14s
cockroachdb-ny3-2                             1/1     Running   0              4m8s
skupper-router-7c8b58d8fc-8pjzv               2/2     Running   0              6m8s
skupper-service-controller-64bbf9bf77-8r87k   1/1     Running   0              6m8s

skupper status
Skupper is enabled for namespace "cockroach" in interior mode. It is connected to 1 other site. It has 1 exposed service.
The site console url is:  https://.......:31602
The credentials for internal console-auth mode are held in secret: 'skupper-console-users'

On the cluster connecting to AMS (called NY3) I see:

kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
cockroachdb-ny3-0                             1/1     Running   0          23m
cockroachdb-ny3-1                             1/1     Running   0          24m
cockroachdb-ny3-2                             1/1     Running   0          24m
cockroachdb-proxy-0                           1/1     Running   0          103m
cockroachdb-proxy-1                           1/1     Running   0          103m
cockroachdb-proxy-2                           1/1     Running   0          103m
skupper-router-7487cc7b65-8xpwj               2/2     Running   0          43h
skupper-service-controller-689cd95f8b-6f4l5   1/1     Running   0          43h

skupper status
Skupper is enabled for namespace "cockroach" in interior mode. It is connected to 1 other site. It has 1 exposed service.

But it seems I still have connection issues. For one, the typical hostnames that would be created by a statefulset, don't exist for the proxy pods...

e.g. ping cockroachdb-ny3-1.cockroachdb from the NY3 cluster works ping cockroachdb-ny3-1.cockroachdb from the AMS3 cluster doesn't work (the hostname doesn't resolve)

Any ideas what's up? I'm not sure it will fix all my issues to group 2 clusters into a cockroachdb cluster, but at least I'll have a hostname that's usable. I'm currently following this guide: https://www.wrong.dev/post/cockroachdb-multi-cloud-using-skupper

And a side question: Why did the NY3 pods get renamed to -proxy, while the AMS3 pods kept the name from the other cluster?

Thanks for your help! I was happy to see the synchronization taking effect. At one point I might still experment with nginx ingress as well.

grs commented 2 years ago

Great, that 15min timeout was indeed the issue. Can I somehow help improve the docs as this was not clear to me. Also, I assume that the token can only be used once?

By default, yes, the token can only be used once and within 15 mins. You can alter that through options on the skupper token create command however.

Improvements to the docs are always very much welcome!

grs commented 2 years ago

the typical hostnames that would be created by a statefulset, don't exist for the proxy pods...

e.g. ping cockroachdb-ny3-1.cockroachdb from the NY3 cluster works ping cockroachdb-ny3-1.cockroachdb from the AMS3 cluster doesn't work (the hostname doesn't resolve)

Any ideas what's up?

What do you see for kubectl get svc?

Fyi we also have an example for this here: https://github.com/grs/skupper-example-cockroachdb

phalox commented 2 years ago

AMS3 (main skupper):

kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                           AGE
cockroachdb            ClusterIP   None             <none>        26257/TCP,8080/TCP                2d20h
cockroachdb-public     ClusterIP   10.245.241.196   <none>        26257/TCP,8080/TCP                2d20h
skupper                NodePort    10.245.115.108   <none>        8080:31602/TCP,8081:30696/TCP     46h
skupper-router         NodePort    10.245.151.63    <none>        55671:32183/TCP,45671:32538/TCP   46h
skupper-router-local   ClusterIP   10.245.123.22    <none>        5671/TCP                          46h

NY3:

kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)               AGE
cockroachdb            ClusterIP   None             <none>        26257/TCP,8080/TCP    2d20h
cockroachdb-public     ClusterIP   10.245.133.161   <none>        26257/TCP,8080/TCP    2d20h
skupper                ClusterIP   10.245.97.248    <none>        8080/TCP,8081/TCP     44h
skupper-router         ClusterIP   10.245.16.228    <none>        55671/TCP,45671/TCP   44h
skupper-router-local   ClusterIP   10.245.185.131   <none>        5671/TCP              44h

I'll check out the example. I didn't set up skupper in the 2nd direction. I assume that wasn't a requirement?

grs commented 2 years ago

You only need to link skupper one way. I suspect the issue may be a collision over the headless service? If you prefix/postfix the headless service to match the statefulset it is used with then that would avoid that.

phalox commented 2 years ago

Great, that seems to indeed have fixed the dns resolution error. I can now ping from NY3 to AMS3 pods. Why do the AMS3 pods receive a name with proxy inside?

For some reason, cockroach db is still not connecting to the proxied nodes.

The pods (on NY3) start with this:

cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host $(hostname -f) --http-addr 0.0.0.0 --join cockroachdb-ny3-0.cockroachdb-internal-ny3,cockroachdb-ny3-1.cockroachdb-internal-ny3,cockroachdb-ny3-2.cockroachdb-internal-ny3,cockroachdb-ams3-0.cockroachdb-internal-ams3,cockroachdb-ams3-1.cockroachdb-internal-ams3,cockroachdb-ams3-2.cockroachdb-internal-ams3 --cache $(expr $MEMORY_LIMIT_MIB / 4)MiB --max-sql-memory $(expr $MEMORY_LIMIT_MIB / 4)MiB

But I don't see any mention of a connection to ams3 in there. I'm also using cockroach in secure mode, but all ssl certs should have been shared and configured for the right host names. I'll go debug some more...

grs commented 2 years ago

The -proxy pods are server side proxies for requests from the other cluster. Did you expose the -ams3 statefulset?

phalox commented 2 years ago

Ohhhh, I think I should not initialize the cockroach clusters twice (before joining them). I've removed the 2nd initialized one, started everything up again and now have 6 cockroach nodes talking to one another!

image

And indeed, I should have exposed the other statefulset as well. I've done so now.

ams:

kubectl get pods
NAME                                          READY   STATUS              RESTARTS   AGE
cockroachdb-ams3-0                            1/1     Running             0          37m
cockroachdb-ams3-1                            1/1     Running             0          37m
cockroachdb-ams3-2                            1/1     Running             0          37m
cockroachdb-internal-ams3-proxy-0             1/1     Running             0          36m
cockroachdb-internal-ams3-proxy-1             1/1     Running             0          36m
cockroachdb-internal-ams3-proxy-2             1/1     Running             0          36m
cockroachdb-ny3-0                             1/1     Running             0          5s
cockroachdb-ny3-1                             1/1     Running             0          3s
cockroachdb-ny3-2                             0/1     ContainerCreating   0          1s
skupper-router-7c8b58d8fc-8pjzv               2/2     Running             0          3h45m
skupper-service-controller-64bbf9bf77-8r87k   1/1     Running             0          3h45m

ny:

kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
cockroachdb-ams3-0                            1/1     Running   0          36m
cockroachdb-ams3-1                            1/1     Running   0          36m
cockroachdb-ams3-2                            1/1     Running   0          36m
cockroachdb-internal-ny3-proxy-0              1/1     Running   0          16s
cockroachdb-internal-ny3-proxy-1              1/1     Running   0          14s
cockroachdb-internal-ny3-proxy-2              1/1     Running   0          13s
cockroachdb-ny3-0                             1/1     Running   0          6m3s
cockroachdb-ny3-1                             1/1     Running   0          6m3s
cockroachdb-ny3-2                             1/1     Running   0          6m3s
skupper-router-7487cc7b65-8xpwj               2/2     Running   0          47h
skupper-service-controller-689cd95f8b-6f4l5   1/1     Running   0          47h

Thanks a bunch for your help! Now I should have a look if I should be running cockroach in secure mode, or if I can make it work non-secure as well. From their docs I understand that operation in non-secure doesn't really work...