Closed sunshine69 closed 3 years ago
OK here is a proof of concept repo https://github.com/sunshine69/bridge-k8s-test . It looks like it works this time with simple k8s deployment like that and I still have no idea why the original app with similar (albeit it has many more annotation and conditions etc ..)
But my observation is the same. The plugin created a pod copying everything the current one has including the liveness probe. However it uses different image thus the pod always failed.
The one works now does not have liveliness probe setup at all.
OK I tried to add the liveness in and now as I express before it failed. See the second commit in the repo.
One observation is that even after stop the plugin the resource deployment/routingmanager-deployment and service/routingmanager-service has not yet clean up. Not sure if it is intentional.
I restart the plugin still leaving these resource and things stop working (no re-route to my local instance)
Then I delete these resources and then try again. still not work.
So to recap after this simple tests:
Here is the resource status when it stop working
stevek@macbook-work 22:43 ~/tmp> k get all
NAME READY STATUS RESTARTS AGE
pod/routingmanager-deployment-788bfd688d-9fpjd 1/1 Running 0 32m
pod/stevek-test-bridge-k8s-webservice-0 1/1 Running 0 32m
pod/stevek-test-bridge-k8s-webservice-0-restore-5b767-8dmnc 1/1 Running 0 32m
pod/test-bridge-k8s-webservice-0 1/1 Running 0 39m
pod/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864-j74p8 1/1 Running 0 32m
pod/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864-x24vx 1/1 Running 0 32m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/routingmanager-service ClusterIP 10.15.23.28 <none> 8766/TCP 32m
service/test-bridge-k8s-webservice ClusterIP 10.15.188.68 <none> 80/TCP 94m
service/test-bridge-k8s-webservice-cloned-routing-svc ClusterIP 10.15.99.154 <none> 80/TCP 32m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/routingmanager-deployment 1/1 1 1 32m
deployment.apps/test-bridge-k8s-webservice-envoy-routing-deploy 2/2 2 2 32m
NAME DESIRED CURRENT READY AGE
replicaset.apps/routingmanager-deployment-788bfd688d 1 1 1 32m
replicaset.apps/test-bridge-k8s-webservice-envoy-routing-deploy-5c5649864 2 2 2 32m
NAME READY AGE
statefulset.apps/test-bridge-k8s-webservice 1/1 94m
NAME COMPLETIONS DURATION AGE
job.batch/stevek-test-bridge-k8s-webservice-0-restore-5b767 0/1 32m 32m
stevek@macbook-work 22:43 ~/tmp>
The sidecar pod I describe always failed is pod/stevek-test-bridge-k8s-webservice-0
k describe pod/stevek-test-bridge-k8s-webservice-0
Name: stevek-test-bridge-k8s-webservice-0
Namespace: stevek-play-k8sbridge
Priority: 0
Node: aks-d4ds-27754521-vmss00002c/10.14.11.8
Start Time: Sat, 19 Jun 2021 22:10:40 +1000
Labels: routing.visualstudio.io/route-from=test-bridge-k8s-webservice
Annotations: mindaro.io/correlation-id: e031eed7-6e23-40de-bea5-1d94b150babf1624094247780:db40b0db776f:76c1da3ef60d
routing.visualstudio.io/debugged-container-name: test-bridge-k8s-webservice
routing.visualstudio.io/route-on-header: kubernetes-route-as=stevek-357c
Status: Running
IP: 10.14.11.66
IPs:
IP: 10.14.11.66
Containers:
test-bridge-k8s-webservice:
Container ID: containerd://226a029efb8198d2352e09a8f4bb9cf161b0ca154b000d181f0f701c7d812061
Image: bridgetokubernetes.azurecr.io/lpkremoteagent:0.1.6
Image ID: bridgetokubernetes.azurecr.io/lpkremoteagent@sha256:39e771c1b228d944102d4dc619a6b0d260b7b57f17762906f00b1f3a71315348
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 19 Jun 2021 22:10:41 +1000
Ready: True
Restart Count: 0
Requests:
memory: 32Mi
Liveness: http-get http://:8080/ delay=0s timeout=30s period=30s #success=1 #failure=3
Readiness: http-get http://:8080/ delay=0s timeout=30s period=30s #success=1 #failure=2
Startup: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=30
Environment:
BRIDGE_COLLECT_TELEMETRY: True
CONSOLE_VERBOSITY: Verbose
BRIDGE_CORRELATION_ID: e031eed7-6e23-40de-bea5-1d94b150babf1624094247780:db40b0db776f:76c1da3ef60d
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fjz52 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-fjz52:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-fjz52
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 33m kubelet Container image "bridgetokubernetes.azurecr.io/lpkremoteagent:0.1.6" already present on machine
Normal Created 33m kubelet Created container test-bridge-k8s-webservice
Normal Started 33m kubelet Started container test-bridge-k8s-webservice
Warning Unhealthy 33m (x2 over 33m) kubelet Startup probe failed: Get "http://10.14.11.66:8080/": dial tcp 10.14.11.66:8080: connect: connection refused
vscode does not show error this time (as pod marked as running, the probe is not compulsory) but request to the url curl -k -X GET https://stevek-357c.test-bridge-k8s.go1.cloud/
return nothing, (not sure where it ends up it but not getting to the local instance.
Can anyone have a look at this bug please? It is a blocker for us to use the extension at the moment.
Thanks
Thanks @sunshine69 for logging this issue! I just wanted to let you know that we're looking into it, and that we need a bit more time to give you a more complete answer. We should have something to share tomorrow.
Hi @sunshine69, Thanks so much for taking the time to log this issue and give us your feedback. I'll try to address the biggest issues that you mention here, and please don't hesitate to let me know if I missed something.
"It is not clear which type of service / ingress that the plugin support thus I assume it works for all."
This is really good feedback, and we will be working to improve our documentation and add more informative error messages in the code when unsupported types are attempted. As for your specific scenario, we do support headless services and statefulsets, but the updates were only rolled out to everyone today. (You shouldn't need to upgrade the extension, but the next time you use VS Code you should see "Updating dependencies…" in the status bar next to the "Kubernetes" tab.)
Also, do you need the isolation mode (i.e., do you typically work on a cluster shared between multiple developers)? This is currently not supported for statefulsets + headless services, but we try to prioritize our work based on customer feedback, so please let us know if you need it.
"One observation is that even after stop the plugin the resource deployment/routingmanager-deployment and service/routingmanager-service has not yet clean up. Not sure if it is intentional."
We do leave the routing manager running on purpose, since it is time-intensive to set up. The extra pods that get set up during a routing session (the envoy pods, ++) and that I see from your "k get all" snapshot should eventually get cleaned up after a few minutes after disconnecting the session.
We replace the image in the pod with our own image that works as a proxy between the code running on your machine and all the resources in your cluster. For the liveness and readiness probes, we've done the work to support these for deployments for the last few months, but I'll need to double check that we support this correctly with statefulsets. For now, can you try debugging in non-isolated mode to see if that changes anything?
Thanks again for trying us out and reporting this issue!
Hi there,
Thanks for your time. We do need isolation mode as per our requirements.
I am not sure if your statement is correct though ...
This is currently not supported for statefulsets + headless services
I use my repo and it works flawlessly (at the time I wrote this ticket). Just that I remove my liveness probe check. You can easily replicate yourself using my repo in the branch working
.
I am unable to test yesterday there seems to be a regression in one of your container images that caused the issues but I will verify it again today.
For other normal deployment scenarios I did test as well and hiting the same problems with liveness probe health.
I imagine if you can first fix the liveness health then it should be okay. That is when you create the envoy proxy pod do not copy the existing liveness probe. It does not make sense to do so because your envoys using totally different image from the one you copy from.
You can make your own liveness health check depends on how you design your envoy image.
If you can please ...
Thank you so much for your attention.
Team how is it going with regards to this issue? Is there anything more that I can help you with?
Hi @sunshine69,
Thanks so much for the patience. I re-tried this scenario today with the sample you provided and can hopefully provide a few more details. I can see that you are correct in that the connection gets set up correctly for the sample, even when we have enabled isolation mode. However, the actual "isolation" piece of it is not working.
I started by putting my breakpoint here:
When I hit the non-isolated URL, my breakpoint was not hit (expected), and endpoint is working correctly.
When I hit the isolated URL, my breakpoint was not hit (as it should have if routing were working), and we failed to resolve the endpoint.
The reason for this is the difference between how DNS works for ClusterIP services vs. headless services -- it will require a new design compared to the classic routing scenario. If this is a blocker for your team, I can bring this up during triage and see if we can prioritize it. (Bridge also needs to get better at providing output when a scenario is not supported.)
In the short-to-medium term, the answer to your question is that routing + statefulsets aren't supported correctly, and if you need liveness probes it would be best to debug the application in non-isolated mode. The long term answer is that we need to support this and learning more about your team's scenario will help us to prioritize.
Thanks!
Hi, thanks for the test.
I did not set break point in my test but I clearly see that the hit is hiting my local server as in the vscode debugging console it print and incremement each time I hit the request.
I will record a video session and post it here tomorrow (not sure if this ticket allow attahcing a loom session but I will try tomorrow, a bit late now.
I have done test today and all work with deployment and statefullset as well. So well done to the team, not sure which changes that fixes it but I found it much more stable than before.
This also fixes the issues with liveness probe as well - that is my deployment now can have the liveness probe as normal, and the convey pod does not copy it over and failed.
@sunshine69 so pleased to hear this! Let us know if you encounter anything else.
Describe the bug
To Reproduce I will explain the situation and will make a simple test project to attach later on.
So I follow the doc, and try to run the TODO and even the bike app. All work. Good news. Then we tried our company app. It does not work.
So I tried to look deeper into how the things work and why first. From the documentation I assume that it would work with all setups as long as:
It is not clear which type of service / ingress that the plugin support thus I assume it works for all.
So I tried a pretty much standard k8s app, very simple one, which uses an nginx ingress - a service and one pod with one container. It even does not requires any other external service at all.
My purpose is that I would be able to create a new endpoint after running the plugin and requests to the endpoint will be routed to my local instance. The existing endpoint will still be routed to k8s.
However it does not seems to work - the plugin choked at
I watched the operations and then find out that it looks like it does different thing in different project. For the TODO app, it created the route manager pod, for my custom app, it says it detect the service and try to create a side-car pod with my pod.
Unfortunately the pod create failed as it tried to copy the current pod manifest but using different images, not sure why and what (no source code, no trace is available to me and explain why it does that differently then in the TODO app or the bike app)
You can see the above - that is the side-car pod the plugin tries to create. Mine one has lively probe thus it created exactly the same. However the image is not my image so how it can be success for the live probe? As expected it dies and the plugin failed.
So I tried even edit my statefullset description and remove the live check. Still failed. I have not describe the reason yet though.
Now so far I think:
It is only tested in these apps and we don't know the typical deployment scenarios it supports. Obviously not supporting my statefulset deployment.
I only see the diagram of creating the ingress clone and service clone but not pod clone. Why and what is the purpose of pod clone? (so it will explain why it gets all wrong by copying the current pod config except the image part and expect it to work together with liveness probe etc..
I will create a prototype of the app above and push it in a repo and link here later on with all steps I did if requested.
I hope:
To be honest it looks like even less beta quality at the moment please correct if I am doing something seriously wrong here.
Thanks
**Expected behavior** - It works for my application **Logs** Will prove later on for the prototype case if requested. **Environment Details** Client used (VS Code/Visual Studio): Client's version: Visual Studio Code on Linux 1.57.0 plugin ``` Version: 1.57.0 Commit: b4c1bd0a9b03c749ea011b06c6d2676c8091a70c Date: 2021-06-09T17:18:42.895Z Electron: 12.0.9 Chrome: 89.0.4389.128 Node.js: 14.16.0 V8: 8.9.255.25-electron.0 OS: Linux x64 5.12.11 ``` Operating System: Ubuntu Linux 20.04 Plugin version: v1.0.120210615 **Additional context**