lensapp / lens

Lens - The way the world runs Kubernetes
https://k8slens.dev/
MIT License
22.4k stars 1.45k forks source link

Internet connection stops working when using Lens #6063

Open cachila opened 2 years ago

cachila commented 2 years ago

Describe the bug From time to time when using Lens, the internet connection stops working on my computer. After closing Lens, it starts working again.

To Reproduce Steps to reproduce the behavior:

  1. Open Lens.
  2. Keep using it and after some hours, Internet will stop working.

Expected behavior Internet connection keeps active.

Environment (please complete the following information):

Additional context Usually it happens once in a work day (8 - 9 hours).

matti commented 2 years ago

hmm, I may have experienced this too. weird.

Nokel81 commented 2 years ago

We would like to better understand how you are using Lens to see if we can track down why Lens would be able to do this.

Are you connecting to a lot of clusters at once? Do you use other features such as Lens Spaces? Could you use a network analyzer to display the active connections that Lens is making when this next happens?

cachila commented 2 years ago
BenNF commented 2 years ago

I've had this happen as well maybe 3-5 times over 2 weeks, but haven't been able to figure out any specific behavior that triggers it.

Lens version: 6.0.1-latest.20220810.2 OS: OSX 12.5.1 Installation: Install from .dmg, arm64 version

I'm mostly only looking at the logs of pods and will have several tabs open at once for an extended period, so that my best guess for a cause.

OuFinx commented 2 years ago

The same issue.

Two days I tried to understand where is the problem and figured out that when my Lens is open my internet connection after ~10-15 minutes stop working.

Also yesterday there was a situation, Lens was frozen, and I couldn’t close it even through the activity monitor, after which my laptop froze and I had to turn it off while holding the button

Lens version: 6.0.1-latest.20220810.2 OS: OSX 12.5.1 Macbook: Air 2020 M1

I ran ping command and what happened during this weird situations

image

Also, were a lot of situations where ping showed me the same like my wifi didn't work (can't make ping request)

I closed Lens and ping started work again. I opened Lens, worked ~10 mins, and everything repeated

OuFinx commented 2 years ago

In continuation of the previous message, just caught the same thing

CleanShot 2022-08-25 at 12 38 58@2x
cachila commented 2 years ago

Also yesterday there was a situation, Lens was frozen, and I couldn’t close it even through the activity monitor, after which my laptop froze and I had to turn it off while holding the button

I happened the same to me a few weeks ago. At that time I didn't realice it was Lens.

nicobistolfi commented 2 years ago

Same issue here, from time to time the internet connection drops, and after closing Lens, everything goes back to normal.

version: Lens: 6.0.1-latest.20220810.2

@Nokel81 I'm connecting to multiple clusters at once, up to 3-4 at the same time and I use Lens Space for 1 cluster.

jakolehm commented 1 year ago

I have seen this also once, didn't realize it was Lens fault. Is everyone else also using M1 (arm64)?

cachila commented 1 year ago

@jakolehm Indeed using M1

Kulagin-G commented 1 year ago

I've also seen this issue twice this week but unfortunately, I don't have specific details at the moment.

MacOS Monterey 12.1 (21C52)
Apple M1 Pro
monoxane commented 1 year ago

I'm hitting this several times a day, the more clusters I have connected the worse it is, and the more resources in those clusters the worse it is.

Having 1 cluster open that has several hundred pods on 10 nodes will cause a complete network failure in under 40 minutes every time.

It seems to be related to the total amount of resources in currently connected clusters.

M1 MBP with macOS 12.5.1 (21G83)

monoxane commented 1 year ago

Just had it happen again while monitoring ping and netstat, no unusual amount of sockets opened, once I disconnected the clusters and closed lens everything works again, will add more diagnostics and try catch more data.

Kulagin-G commented 1 year ago

I've been connected to EKS cluster but in a one-moment connection session expired, right after that Lens UI got stuck (I think it's electron part) and my VPN connection was dropped. Maybe it's just a coincidence. In Lens logs I've found only these messages, where xxxxxxxx is a hidden EKS resource. Hope it will be helpful cuz have no time to investigate this issue.

info: [STORE]: SAVING /Users/gkulagin/Library/Application Support/Lens/lens-cluster-store.json
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:40:05 PM User has logged out or didn't agree to connect to Lens Cloud, do nothing on resume. .
error: [CLUSTER]: Failed to connect to "xxxxxxxx": StatusCodeError: 500 - "read tcp xxxxxx: read: connection re
set by peer\n"
info: [STORE]: SAVING /Users/gkulagin/Library/Application Support/Lens/lens-cluster-store.json
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:40:05 PM User has logged out or didn't agree to connect to Lens Cloud, do nothing on resume. .
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:40:05 PM Broadcasted spaces change to SpacesListener on renderer
warn: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:40:05 PM No spaces in SpaceSyncer disposeSpacesReaction, not changing selected space.
info: [LENS-SPACES-EXTENSION]: (from:TokenRefresher) 9/12/2022, 1:40:05 PM Detect token expires in 26597s (26597285.00008583ms), set refresh interval.
info: [LENS-SPACES-EXTENSION]: (from:TokenRefresher) 9/12/2022, 1:40:05 PM Start token refresh iterator... scheduled next refresh 9/12/2022, 8:54:31 PM
info: [CLUSTER]: refresh {"accessible":false,"disconnected":false,"id":"3ad34ed6e316c60c111d99f54ad4134a","name":"xxxxxxxx","online":false,"ready":true}
info: [CLUSTER]: refresh {"accessible":true,"disconnected":false,"id":"3ad34ed6e316c60c111d99f54ad4134a","name":"xxxxxxxx","online":true,"ready":true}
info: [CLUSTER]: refreshMetadata {"accessible":true,"disconnected":false,"id":"3ad34ed6e316c60c111d99f54ad4134a","name":"xxxxxxxx","online":true,"ready":true}
error: [UPDATE-APP/CHECK-FOR-UPDATES] net::ERR_NETWORK_CHANGED {"stack":"Error: net::ERR_NETWORK_CHANGED\n    at SimpleURLLoaderWrapper.<anonymous> (node:electron/js2c/browser_init:105:7068)\n    at SimpleURLLoaderWrapper.emit (node:events:394:28)\n    at SimpleURLLoaderWrapper.emit (node:domain:470:12)"}
info: [CLUSTER-MANAGER]: network is offline
warn: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM isLensCloudStatusOk returns false
info: [LENS-SPACES-EXTENSION]: (from:TokenRefresher) 9/12/2022, 1:55:46 PM Token refresh iterator stopped by ok from false => true
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM Offline space Added
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM Broadcasted spaces change to SpacesListener on renderer
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM Selected the offline space...
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM setting resumeState to SUSPENDING
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:55:46 PM setting resumeState to undefined
error: [CLUSTER]: Failed to connect to "xxxxxxxx": StatusCodeError: 500 - "dial tcp: lookup xxxxxxxx on [::1]:53: read udp [::1]:57701->[::1]:53: read: connection refused\n"
error: [CLUSTER]: Failed to connect to "xxxxxxxx": StatusCodeError: 500 - "dial tcp: lookup xxxxxxxx on [::1]:53: read udp [::1]:57701->[::1]:53: read: connection refused\n"
info: [CLUSTER-MANAGER]: network is online
error: [CLUSTER]: Failed to connect to "xxxxxxxx": StatusCodeError: 500 - "dial tcp: lookup xxxxxxxx on [::1]:53: read udp [::1]:57715->[::1]:53: read: connection refused\n"
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:56:06 PM User has logged out or didn't agree to connect to Lens Cloud, do nothing on resume. .
info: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:56:06 PM Broadcasted spaces change to SpacesListener on renderer
warn: [LENS-SPACES-EXTENSION]: 9/12/2022, 1:56:06 PM No spaces in SpaceSyncer disposeSpacesReaction, not changing selected space.
info: [LENS-SPACES-EXTENSION]: (from:TokenRefresher) 9/12/2022, 1:56:06 PM Detect token expires in 25636s (25636170.000076294ms), set refresh interval.
info: [LENS-SPACES-EXTENSION]: (from:TokenRefresher) 9/12/2022, 1:56:06 PM Start token refresh iterator... scheduled next refresh 9/12/2022, 8:54:50 P
monoxane commented 1 year ago

Opening lens again after its happened but before a reboot causes it to happen immediately.

monoxane commented 1 year ago

Seems to be related to the amount of kubernetes updates being processed, potentially something broken in a watch loop. I've left lens open in the background in the state that usually causes issues for a few hours and it has not done it, started working with the cluster in terminal with kubectl and after applying a bunch of manifests it immediately killed my internet.

cachila commented 1 year ago

Any news regarding this issue?

wallacepf commented 1 year ago

Same Thing here. Lost the connection many times during the day.

Latest version of Lens M1 MBP with OSX 12.6.1

matti commented 1 year ago

@wallacepf and others - are you using Lens with multiple clusters?

wallacepf commented 1 year ago

In my case, nope. I have a single AKS for demo purposes. Brand new BTW, so I don't believe the number of objects influences the issue.

gaspo53 commented 1 year ago

Hi! Same happening here, it's like the app does something with the host network and overflows something

cachila commented 1 year ago

Any updates on this issue? It's making the use of Lens a bit frustrating.

monoxane commented 1 year ago

I've switched to using OpenLens directly instead of the proprietary binary and have not had a problem since. I think it's something in their telemetry/data harvesting addons that's broken.

jakolehm commented 1 year ago

I haven't experienced this anymore with the latest Lens Desktop builds (been running Lens all day long on M1). Still unclear to me what could cause it.

cachila commented 1 year ago

Still having issues with the latest build, even with OpenLens.

cachila commented 1 year ago

It seems to happen more frequently if I use the app "heavily". ie, access a cluster, restart a few deployments, change cluster and repeat on 2 or 3 clusters.

monoxane commented 1 year ago

So it does happen in OpenLens, and its even more intriguing than I thought, I just happened to have the macos network settings open for something else I was looking at and it cleared the IPv4 settings from my primary interface! IPv6 remained which completely explains why some sites and endpoints still worked after the issue happened but not all.

Immediately after this happened I went back to my work and hit up in zsh only to have this unexpected scrollback: '/Applications/OpenLens.app/Contents/MacOS/OpenLens' -p '"856046b874f54de788d62ef8cc0b2478" + JSON.stringify(process.env) + "856046b874f54de788d62ef8cc0b2478"'

Nokel81 commented 1 year ago

@monoxane What version were you running? #6551 is only part of 6.1.19.

That line is there because we sync shell env variables. It is a similar method to what VSCode does.

monoxane commented 1 year ago

@Nokel81, that makes sense, it might be indicative of something dying and getting called again then because it happened at exactly the same time.

image

Nokel81 commented 1 year ago

We do have a timeout of 30s on that shell sync but I don't understand how sending a SIGTERM to a process would clear the IPv4 settings.

monoxane commented 1 year ago

I don't think that command is the problem, more so that it's indicative of something else breaking somewhere and causing it to get re-executed. I'm not great at typescript nor the architecture in use here so my brief attempt into looking didn't find anything that obviously stood out to me.

Nokel81 commented 1 year ago

Does the issue coincide with opening a terminal within Lens?

monoxane commented 1 year ago

no it does not, it happened when lens was in the background not being used

cachila commented 1 year ago

It seems to happen more frequently if I use the app "heavily". ie, access a cluster, restart a few deployments, change cluster and repeat on 2 or 3 clusters.

@monoxane @Nokel81 Are you able to replicate doing this?

Nokel81 commented 1 year ago

@cachila I have not been able to reproduce it at all. One team member has but only once weeks ago.

monoxane commented 1 year ago

I will try and catch it again tomorrow and dump logs, while it's more likely to happen with more usage, it's very hard to reproduce consistently. I've had it happen in as little as a minute of lens being open and connected to a single cluster, and I've also not had it happen for a week straight connected to 13 clusters with a lot of changing data. What i just realised is interesting is that there seems to be 2 different forms of crash, one that just kills internet for a few seconds, and another that causes full networking lockup including but not limited to ZeroTier dropping offline, all open firefox tabs and extensions crashing, pings failing, and network preferences showing no IP. The first usually self resolves but the second generally needs a full reboot to get everything back up and running.

cachila commented 1 year ago

@cachila I have not been able to reproduce it at all. One team member has but only once weeks ago.

Are you using M1? It seems to happen only on these kind of processors.

What i just realised is interesting is that there seems to be 2 different forms of crash, one that just kills internet for a few seconds, and another that causes full networking lockup including but not limited to ZeroTier dropping offline, all open firefox tabs and extensions crashing, pings failing, and network preferences showing no IP. The first usually self resolves but the second generally needs a full reboot to get everything back up and running.

The second type of crash is what usually happens to me. Although, I've just found that the Lens process keeps running even if I close the app. Force quitting from Activity Monitor solved the connectivity issue. This is something that happened only once today but will update if the workaround works everytime.

Nokel81 commented 1 year ago

Are you using M1? It seems to happen only on these kind of processors.

No, but the team member that did reproduce it once was yes

BenNF commented 1 year ago

Force quitting from Activity Monitor solved the connectivity issue. This is something that happened only once today but will update if the workaround works everytime.

This is pretty much my standard workflow with lens now, I experience the second kind of crash pretty consistently about 3-5 times during a work day, but force quitting from activity monitor always solves it.

It does seem correlated without how many changes I'm making to the clusters I'm connected to, but not in any super consistent way, just more updates means more likely to break the network stack.

Using: CPU: M1 OS: 12.6 OpenLens:

OpenLens: 6.1.16-latest.1667926818306
Extension API: 6.1.16
Electron: 19.1.4
Chrome: 102.0.5005.167
Node: 16.14.2
monoxane commented 1 year ago

Just had it happen again while editing a bunch of stuff across 3 GKE clusters, was the second major crash I mentioned, internet settings stayed configured fine so I have a feeling that one might be unrelated. Cannot find any relevant logs of the crash but the dev tools of the UI showed it happened when initing the watch connection for pods.

OpenLens: 6.1.19-latest.1668106139289 Extension API: 6.1.19 Electron: 19.1.5 Chrome: 102.0.5005.167 Node: 16.14.2

Darwin zeta.local 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:35 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101 arm64 (MacOS 12.5.1 on M1)

monoxane commented 1 year ago

Okay something else I've noticed, its significantly more likely to crash if I'm using remote clusters with somewhat high latency (GKE in Netherlands and the US) than if I'm using the cluster in the next room on the same local subnet. Does this track with anyone else's experience?

cachila commented 1 year ago

@monoxane I am usually connected to 3 or 4 cloud clusters on DO and AWS

monoxane commented 1 year ago

If that's the case I wonder if what's actually going on is lower in the OS, some sort of limit on TCP socket activity in the M1 version of macos that's exacerbated by high latency long lasting connections. All the kube watch requests are websockets connections so they could potentially be timing out with higher latency to cloud, getting reconnected, but leaving the old ones hang somewhere in the network stack, starving everything else of available sockets until it the network stack itself crashes. With this idea in mind I'll do some more debug next time it happens.

Nokel81 commented 1 year ago

Actually the Kube watch requests are long polling HTTP requests. If they were websockets that would be much better for us IMO.

So yes that could be an issue. I will see if I can find any information about that.

matti commented 1 year ago

@Nokel81 what if you'd add some debug information on lens that people could check when this bug happens.

matti commented 1 year ago

I saw "no buffer space available" when this happened

matti commented 1 year ago

I saw this when it happened

image
matti commented 1 year ago

happened again after 3 hours of working with two clusters

matti commented 1 year ago

again with one cluster after ~3h of uptime.

matti commented 1 year ago

again with ERR_NO_BUFFER_SPACE