microsoft / mindaro

Bridge to Kubernetes - for Visual Studio and Visual Studio Code
MIT License
307 stars 106 forks source link

Please document clear all dependencies and explain some oddities ... #192

Closed sunshine69 closed 3 years ago

sunshine69 commented 3 years ago

Describe the bug

To Reproduce I will explain the situation and will make a simple test project to attach later on.

So I follow the doc, and try to run the TODO and even the bike app. All work. Good news. Then we tried our company app. It does not work.

So I tried to look deeper into how the things work and why first. From the documentation I assume that it would work with all setups as long as:

It is not clear which type of service / ingress that the plugin support thus I assume it works for all.

So I tried a pretty much standard k8s app, very simple one, which uses an nginx ingress - a service and one pod with one container. It even does not requires any other external service at all.

My purpose is that I would be able to create a new endpoint after running the plugin and requests to the endpoint will be routed to my local instance. The existing endpoint will still be routed to k8s.

However it does not seems to work - the plugin choked at

Routing successfully enabled for service through pod 'stevek-reportportal-tool-webservice-0' in namespace 'k8s-qa'.
Waiting for 'stevek-reportportal-tool-webservice-0' in namespace 'k8s-qa' to reach running state...
Pod 'stevek-reportportal-tool-webservice-0' created in namespace 'k8s-qa'.
Found container 'reportportal-tool-webservice' in pod 'stevek-reportportal-tool-webservice-0'.
Preparing to run Bridge To Kubernetes configured as pod k8s-qa/stevek-reportportal-tool-webservice-0 ...
Connect operation failed.
Stopping workload and cleaning up...
Restore: Pod 'stevek-reportportal-tool-webservice-0' deleted.
An unexpected error occurred: 'Failed to get routing manager deployment status'
To see our active issues or file a bug report, please visit https://aka.ms/bridge-to-k8s-report.
For diagnostic information, see logs at '/tmp/Bridge To Kubernetes'.
Failed to establish a connection. Error: Connect operation failed.
An unexpected error occurred: 'Failed to get routing manager deployment status'
To see our active issues or file a bug report, please visit https://aka.ms/bridge-to-k8s-report.
For diagnostic information, see logs at '/tmp/Bridge To Kubernetes'.

I watched the operations and then find out that it looks like it does different thing in different project. For the TODO app, it created the route manager pod, for my custom app, it says it detect the service and try to create a side-car pod with my pod.

Unfortunately the pod create failed as it tried to copy the current pod manifest but using different images, not sure why and what (no source code, no trace is available to me and explain why it does that differently then in the TODO app or the bike app)

k -n k8s-qa describe pod stevek-reportportal-tool-webservice-0
Name:         stevek-reportportal-tool-webservice-0
Namespace:    k8s-qa
Priority:     0
Node:         aks-d4ds-27754521-vmss00002e/10.14.12.54
Start Time:   Sat, 19 Jun 2021 13:41:21 +1000
Labels:       routing.visualstudio.io/route-from=reportportal-tool-webservice
Annotations:  mindaro.io/correlation-id: b8e4210b-df58-4e21-81bb-8f1f35d9a9cf1624004947506:4eaad6a01477:449eed9ca8ba
              routing.visualstudio.io/debugged-container-name: reportportal-tool-webservice
              routing.visualstudio.io/route-on-header: kubernetes-route-as=stevek-7244
Status:       Running
IP:           10.14.12.133
IPs:
  IP:  10.14.12.133
Containers:
  reportportal-tool-webservice:
    Container ID:   containerd://d32c0f774d2ead05871ce8466f04efcd2bd1f05327cc115602bf1284a483971a
    Image:          bridgetokubernetes.azurecr.io/lpkremoteagent:0.1.6
    Image ID:       bridgetokubernetes.azurecr.io/lpkremoteagent@sha256:39e771c1b228d944102d4dc619a6b0d260b7b57f17762906f00b1f3a71315348
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 19 Jun 2021 13:41:21 +1000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      20m
      memory:   32Mi
    Liveness:   http-get http://:http-port/ delay=0s timeout=30s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http-port/ delay=0s timeout=30s period=30s #success=1 #failure=2
    Startup:    http-get http://:http-port/ delay=0s timeout=1s period=10s #success=1 #failure=30
    Environment Variables from:

You can see the above - that is the side-car pod the plugin tries to create. Mine one has lively probe thus it created exactly the same. However the image is not my image so how it can be success for the live probe? As expected it dies and the plugin failed.

So I tried even edit my statefullset description and remove the live check. Still failed. I have not describe the reason yet though.

Now so far I think:

It is only tested in these apps and we don't know the typical deployment scenarios it supports. Obviously not supporting my statefulset deployment.

I only see the diagram of creating the ingress clone and service clone but not pod clone. Why and what is the purpose of pod clone? (so it will explain why it gets all wrong by copying the current pod config except the image part and expect it to work together with liveness probe etc..

I will create a prototype of the app above and push it in a repo and link here later on with all steps I did if requested.

I hope:

To be honest it looks like even less beta quality at the moment please correct if I am doing something seriously wrong here.

Thanks

**Expected behavior** - It works for my application **Logs** Will prove later on for the prototype case if requested. **Environment Details** Client used (VS Code/Visual Studio): Client's version: Visual Studio Code on Linux 1.57.0 plugin ``` Version: 1.57.0 Commit: b4c1bd0a9b03c749ea011b06c6d2676c8091a70c Date: 2021-06-09T17:18:42.895Z Electron: 12.0.9 Chrome: 89.0.4389.128 Node.js: 14.16.0 V8: 8.9.255.25-electron.0 OS: Linux x64 5.12.11 ``` Operating System: Ubuntu Linux 20.04 Plugin version: v1.0.120210615 **Additional context**
sunshine69 commented 3 years ago

OK here is a proof of concept repo https://github.com/sunshine69/bridge-k8s-test . It looks like it works this time with simple k8s deployment like that and I still have no idea why the original app with similar (albeit it has many more annotation and conditions etc ..)

But my observation is the same. The plugin created a pod copying everything the current one has including the liveness probe. However it uses different image thus the pod always failed.

The one works now does not have liveliness probe setup at all.

sunshine69 commented 3 years ago

OK I tried to add the liveness in and now as I express before it failed. See the second commit in the repo.

One observation is that even after stop the plugin the resource deployment/routingmanager-deployment and service/routingmanager-service has not yet clean up. Not sure if it is intentional.

I restart the plugin still leaving these resource and things stop working (no re-route to my local instance)

Then I delete these resources and then try again. still not work.

So to recap after this simple tests:

sunshine69 commented 3 years ago

Here is the resource status when it stop working

stevek@macbook-work 22:43 ~/tmp> k get all
NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/routingmanager-deployment-788bfd688d-9fpjd                        1/1     Running   0          32m
pod/stevek-test-bridge-k8s-webservice-0                               1/1     Running   0          32m
pod/stevek-test-bridge-k8s-webservice-0-restore-5b767-8dmnc           1/1     Running   0          32m
pod/test-bridge-k8s-webservice-0                                      1/1     Running   0          39m
pod/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864-j74p8   1/1     Running   0          32m
pod/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864-x24vx   1/1     Running   0          32m

NAME                                                    TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/routingmanager-service                          ClusterIP   10.15.23.28    <none>        8766/TCP   32m
service/test-bridge-k8s-webservice                      ClusterIP   10.15.188.68   <none>        80/TCP     94m
service/test-bridge-k8s-webservice-cloned-routing-svc   ClusterIP   10.15.99.154   <none>        80/TCP     32m

NAME                                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/routingmanager-deployment                         1/1     1            1           32m
deployment.apps/test-bridge-k8s-webservice-envoy-routing-deploy   2/2     2            2           32m

NAME                                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/routingmanager-deployment-788bfd688d                        1         1         1       32m
replicaset.apps/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864   2         2         2       32m

NAME                                          READY   AGE
statefulset.apps/test-bridge-k8s-webservice   1/1     94m

NAME                                                          COMPLETIONS   DURATION   AGE
job.batch/stevek-test-bridge-k8s-webservice-0-restore-5b767   0/1           32m        32m
stevek@macbook-work 22:43 ~/tmp> 

The sidecar pod I describe always failed is pod/stevek-test-bridge-k8s-webservice-0

k describe pod/stevek-test-bridge-k8s-webservice-0
Name:         stevek-test-bridge-k8s-webservice-0
Namespace:    stevek-play-k8sbridge
Priority:     0
Node:         aks-d4ds-27754521-vmss00002c/10.14.11.8
Start Time:   Sat, 19 Jun 2021 22:10:40 +1000
Labels:       routing.visualstudio.io/route-from=test-bridge-k8s-webservice
Annotations:  mindaro.io/correlation-id: e031eed7-6e23-40de-bea5-1d94b150babf1624094247780:db40b0db776f:76c1da3ef60d
              routing.visualstudio.io/debugged-container-name: test-bridge-k8s-webservice
              routing.visualstudio.io/route-on-header: kubernetes-route-as=stevek-357c
Status:       Running
IP:           10.14.11.66
IPs:
  IP:  10.14.11.66
Containers:
  test-bridge-k8s-webservice:
    Container ID:   containerd://226a029efb8198d2352e09a8f4bb9cf161b0ca154b000d181f0f701c7d812061
    Image:          bridgetokubernetes.azurecr.io/lpkremoteagent:0.1.6
    Image ID:       bridgetokubernetes.azurecr.io/lpkremoteagent@sha256:39e771c1b228d944102d4dc619a6b0d260b7b57f17762906f00b1f3a71315348
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 19 Jun 2021 22:10:41 +1000
    Ready:          True
    Restart Count:  0
    Requests:
      memory:   32Mi
    Liveness:   http-get http://:8080/ delay=0s timeout=30s period=30s #success=1 #failure=3
    Readiness:  http-get http://:8080/ delay=0s timeout=30s period=30s #success=1 #failure=2
    Startup:    http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=30
    Environment:
      BRIDGE_COLLECT_TELEMETRY:  True
      CONSOLE_VERBOSITY:         Verbose
      BRIDGE_CORRELATION_ID:     e031eed7-6e23-40de-bea5-1d94b150babf1624094247780:db40b0db776f:76c1da3ef60d
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fjz52 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-fjz52:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fjz52
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Normal   Pulled     33m                kubelet  Container image "bridgetokubernetes.azurecr.io/lpkremoteagent:0.1.6" already present on machine
  Normal   Created    33m                kubelet  Created container test-bridge-k8s-webservice
  Normal   Started    33m                kubelet  Started container test-bridge-k8s-webservice
  Warning  Unhealthy  33m (x2 over 33m)  kubelet  Startup probe failed: Get "http://10.14.11.66:8080/": dial tcp 10.14.11.66:8080: connect: connection refused

vscode does not show error this time (as pod marked as running, the probe is not compulsory) but request to the url curl -k -X GET https://stevek-357c.test-bridge-k8s.go1.cloud/ return nothing, (not sure where it ends up it but not getting to the local instance.

sunshine69 commented 3 years ago

Can anyone have a look at this bug please? It is a blocker for us to use the extension at the moment.

Thanks

daniv-msft commented 3 years ago

Thanks @sunshine69 for logging this issue! I just wanted to let you know that we're looking into it, and that we need a bit more time to give you a more complete answer. We should have something to share tomorrow.

amsoedal commented 3 years ago

Hi @sunshine69, Thanks so much for taking the time to log this issue and give us your feedback. I'll try to address the biggest issues that you mention here, and please don't hesitate to let me know if I missed something.

"It is not clear which type of service / ingress that the plugin support thus I assume it works for all."

This is really good feedback, and we will be working to improve our documentation and add more informative error messages in the code when unsupported types are attempted. As for your specific scenario, we do support headless services and statefulsets, but the updates were only rolled out to everyone today. (You shouldn't need to upgrade the extension, but the next time you use VS Code you should see "Updating dependencies…" in the status bar next to the "Kubernetes" tab.)

Also, do you need the isolation mode (i.e., do you typically work on a cluster shared between multiple developers)? This is currently not supported for statefulsets + headless services, but we try to prioritize our work based on customer feedback, so please let us know if you need it.

"One observation is that even after stop the plugin the resource deployment/routingmanager-deployment and service/routingmanager-service has not yet clean up. Not sure if it is intentional."

We do leave the routing manager running on purpose, since it is time-intensive to set up. The extra pods that get set up during a routing session (the envoy pods, ++) and that I see from your "k get all" snapshot should eventually get cleaned up after a few minutes after disconnecting the session.

We replace the image in the pod with our own image that works as a proxy between the code running on your machine and all the resources in your cluster. For the liveness and readiness probes, we've done the work to support these for deployments for the last few months, but I'll need to double check that we support this correctly with statefulsets. For now, can you try debugging in non-isolated mode to see if that changes anything?

Thanks again for trying us out and reporting this issue!

sunshine69 commented 3 years ago

Hi there,

Thanks for your time. We do need isolation mode as per our requirements.

I am not sure if your statement is correct though ...

This is currently not supported for statefulsets + headless services

I use my repo and it works flawlessly (at the time I wrote this ticket). Just that I remove my liveness probe check. You can easily replicate yourself using my repo in the branch working.

I am unable to test yesterday there seems to be a regression in one of your container images that caused the issues but I will verify it again today.

For other normal deployment scenarios I did test as well and hiting the same problems with liveness probe health.

I imagine if you can first fix the liveness health then it should be okay. That is when you create the envoy proxy pod do not copy the existing liveness probe. It does not make sense to do so because your envoys using totally different image from the one you copy from.

You can make your own liveness health check depends on how you design your envoy image.

If you can please ...

Thank you so much for your attention.

sunshine69 commented 3 years ago

Team how is it going with regards to this issue? Is there anything more that I can help you with?

amsoedal commented 3 years ago

Hi @sunshine69,

Thanks so much for the patience. I re-tried this scenario today with the sample you provided and can hopefully provide a few more details. I can see that you are correct in that the connection gets set up correctly for the sample, even when we have enabled isolation mode. However, the actual "isolation" piece of it is not working.

I started by putting my breakpoint here: image

When I hit the non-isolated URL, my breakpoint was not hit (expected), and endpoint is working correctly. image

When I hit the isolated URL, my breakpoint was not hit (as it should have if routing were working), and we failed to resolve the endpoint. image

The reason for this is the difference between how DNS works for ClusterIP services vs. headless services -- it will require a new design compared to the classic routing scenario. If this is a blocker for your team, I can bring this up during triage and see if we can prioritize it. (Bridge also needs to get better at providing output when a scenario is not supported.)

In the short-to-medium term, the answer to your question is that routing + statefulsets aren't supported correctly, and if you need liveness probes it would be best to debug the application in non-isolated mode. The long term answer is that we need to support this and learning more about your team's scenario will help us to prioritize.

Thanks!

sunshine69 commented 3 years ago

Hi, thanks for the test.

I did not set break point in my test but I clearly see that the hit is hiting my local server as in the vscode debugging console it print and incremement each time I hit the request.

I will record a video session and post it here tomorrow (not sure if this ticket allow attahcing a loom session but I will try tomorrow, a bit late now.

sunshine69 commented 3 years ago

I have done test today and all work with deployment and statefullset as well. So well done to the team, not sure which changes that fixes it but I found it much more stable than before.

This also fixes the issues with liveness probe as well - that is my deployment now can have the liveness probe as normal, and the convey pod does not copy it over and failed.

amsoedal commented 3 years ago

@sunshine69 so pleased to hear this! Let us know if you encounter anything else.