Remote VSCode over SSH crashes EC2 instance

Zenexer commented 4 years ago

Issue Type: Bug

I've been attempting to use the new remote VSCode feature to work with a project stored on an AWS EC2 instance. Each time I use it, it works fine for a few hours. Eventually, the whole instance stops responding. AWS indicates that the instance is unresponsive in the control panel, and I have to force-stop it. The screenshot/log feature on AWS doesn't show anything. Once I boot the instance back up, there's nothing in the logs--they just cut off at the time the instance stopped responding. I wish I had more information to give you, but I'm at a loss of how to troubleshoot this.

Other notes:

If I leave htop or top open, when the instance finally crashes, there's no indication of anything unusual. Plenty of memory, etc.
VSCode complained about fs.inotify.max_user_watches being too low when I first started using it remotely. I increased it per VSCode's instructions and confirmed that it took effect. The warning went away, but the crashes still happen.
Even if I disconnect from the remote session, the instance will still crash.
Instance type: t3a.micro
Region: us-east-1

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan

cat /proc/cpuinfo

% cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x8001250
cpu MHz         : 2199.958
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.91
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:
...

VS Code version: Code 1.43.2 (0ba0ca52957102ca3527cf479571617f0de6ed50, 2020-03-24T07:38:38.248Z) OS version: Windows_NT x64 10.0.18363

System Info

|Item|Value| |---|---| |CPUs|Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz (12 x 4008)| |GPU Status|2d_canvas: enabled
flash_3d: enabled
flash_stage3d: enabled
flash_stage3d_baseline: enabled
gpu_compositing: enabled
multiple_raster_threads: enabled_on
oop_rasterization: disabled_off
protected_video_decode: unavailable_off
rasterization: enabled
skia_renderer: disabled_off_ok
video_decode: enabled
viz_display_compositor: enabled_on
viz_hit_test_surface_layer: disabled_off_ok
webgl: enabled
webgl2: enabled| |Load (avg)|undefined| |Memory (System)|31.95GB (18.31GB free)| |Process Argv|| |Screen Reader|no| |VM|0%|

Extensions (4)

Extension|Author (truncated)|Version ---|---|--- remote-ssh|ms-|0.51.0 remote-ssh-edit|ms-|0.51.0 remote-wsl|ms-|0.42.4 cpptools|ms-|0.27.0

roblourens commented 4 years ago

Probably related to https://github.com/microsoft/vscode-remote-release/issues/2349. Are you opening a large folder and what extensions are you using?

Zenexer commented 4 years ago

Yes, it's a relatively large folder. I'm not using any special extensions as far as I'm aware.

Zenexer commented 4 years ago

I don't think it's related to that:

Memory usage is reasonable
I'm not seeing those reports.*.json files

clshortfuse commented 4 years ago

@roblourens According to https://github.com/microsoft/vscode-remote-release/issues/1110 1GB of RAM isn't enough.

I don't even try on t2.micro since it will without a doubt lock up the instance. This issue mentions t3a.micro which has the same size 1GB of RAM.

Has VSCode reduced the memory requirements, or is this still likely the same issue?

Zenexer commented 4 years ago

I don't even try on t2.micro since it will without a doubt lock up the instance. This issue mentions t3a.micro which has the same size 1GB of RAM.

Memory was my first guess, but it's definitely not running out. There's still plenty available when it eventually crashes.

What's even more confusing is that the instance will still crash even after I've exited VSCode. Once it's been launched, the countdown starts--within a few hours, it will crash whether or not VSCode is still open.

roblourens commented 4 years ago

Hm, I would guess that memory is somehow the issue here even though you say it doesn't seem to be using a lot.

steelkorbin commented 4 years ago

I am concerned that this issue is far more than has been assumed. Using vscode remote I can crash any EC2 instance of any size (e.g., m5.xlarge) or distro (e.g., ubuntu/centos) by only using it for a very short period of time, even idling will kill it off. I attempted to contact the maintainers of the vscode remote plugin directly, but their MS Team Meeting failed and their return email was rejected as none existent following the meeting failure. I am going to need to direct all of my company to avoid the vscode remote services until this is fixed. It was a very exciting idea that showed great promise and I was going to move to endorse it as the staple ssh access method for all our devs and DevOps, but I can not do that now. I can not knowingly use vscode remote against my production ec2 instances just to watch it kill them. FYI, the effect is no ssh is possible from anything following the crash. The ec2 reboot command does not correct it, only complete stop and then start clears up the crash. Then do not use vscode remote and you will be fine. So, I am back to the old ssh terminal methods until this new service is production-grade.

Zenexer commented 4 years ago

I can confirm what @steelkorbin has observed; it doesn't appear to be related to resource usage. No matter the instance type, using VSCode Remote arms a kernel time bomb. Even if VSCode Remote is exited, eventually the instance will crash--hard. No logs, no way to debug the issue. It could happen hours after VSCode Remote has been closed.

ssprakhar commented 3 years ago

has there been any development on this one? i open a very small directory , yet the vm instance crashes, pretty hard. Have no other way but to reload the VM

steelkorbin commented 3 years ago

I do not see the vs code team acknowledging or engaging this.

On Mon, Sep 7, 2020, 10:47 AM Prakhar Sharma notifications@github.com wrote:

has there been any development on this one? i open a very small directory , yet the vm instance crashes, pretty hard. Have no other way but to reload the VM

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-688454603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOWRHZB3FQU6QQCHT5FKYTSEUMCJANCNFSM4L4LP2OA .

roblourens commented 3 years ago

Your report is concerning but I can't reproduce it. I work with vscode remote frequently on an ubuntu cloud VM with an uptime of almost a year. I would need more info but I'm not even sure what to ask for. The easy possibility is that you have some remote extension installed which is causing issues.

pyg commented 3 years ago

I'd be happy to do some testing if there's some steps to run some kind of diagnostic log. It's happening for me, and I've tried turning off all autosuggestions (which seemed to be a trigger) and disabling TS/JS extensions, but it's still happening randomly. I've resorted to using liximomo's sftp extension for now.

For now, I've been watching htop and it jumps to 100% right before it locks up the VM, and there doesn't seem to be any rhyme or reason to it.

On Thu, Sep 10, 2020 at 2:04 PM Rob Lourens notifications@github.com wrote:

Your report is concerning but I can't reproduce it. I work with vscode remote frequently on an ubuntu cloud VM with an uptime of almost a year. I would need more info but I'm not even sure what to ask for. The easy possibility is that you have some remote extension installed which is causing issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-690587902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TGKQH2BK7SIMQOUKD3SFEILZANCNFSM4L4LP2OA .

ssprakhar commented 3 years ago

@roblourens , its crashing on me with fresh install of vscode server on EC2. so all the extensions on the server as basically the ones that there by default. No other extension added. Funny thing is, if I go to extensions, or try to play around, the whole EC2 crashes, like there is nothing I can do to restore/access it. I have to restart the server. One more information, although just going to extensions page makes things slow and at the verge of crashing them, I managed to get TS/JS plugin disabled, that has improved things a bit.

Disabling the above was suggested in the following post https://medium.com/good-robot/use-visual-studio-code-remote-ssh-sftp-without-crashing-your-server-a1dc2ef0936d

roblourens commented 3 years ago

Would be helpful if you can figure out which process is the one using lots of CPU/memory. It may be the generic extension host process or it may be another process associated with some extension.

pyg commented 3 years ago

OK I did a small test by re-typing an existing function under three conditions for five minutes each:

vscode.typescript-language-features disabled. No freezing within 5 minutes.
vscode.typescript-language-features enabled. Freezes within 5 minutes.
vscode.typescript-language-features enabled with typescript.disableAutomaticTypeAcquisition disabled (per https://stackoverflow.com/questions/52935211/disable-tsserver-for-visual-studio-code/52936301). Freezes within 5 minutes.

To be clear, this is the extension disabled or enabled:

Name: TypeScript and JavaScript Language Features Id: vscode.typescript-language-features Description: Provides rich language support for JavaScript and TypeScript. Version: 1.0.0 Publisher: vscode

FWIW, I think there is another bug report similar to this one that is specifically related to this extension. Not sure if it's still open.

On Sat, Sep 12, 2020 at 6:12 PM Rob Lourens notifications@github.com wrote:

Would be helpful if you can figure out which process is the one using lots of CPU/memory. It may be the generic extension host process or it may be another process associated with some extension.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-691559836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TDJ4EE2JGOQSAV4SITSFPW6JANCNFSM4L4LP2OA .

roblourens commented 3 years ago

Can you share the project/repo you are working on? Is it very large?

ssprakhar commented 3 years ago

Can you share the project/repo you are working on? Is it very large?

It is boilerplate of a gatsby project

gatsby new MyProject

Not large. Super small. Tomorrow I will try on a repo with just one small text file, and update you.

clshortfuse commented 3 years ago

Is anybody having these issues with Azure? I only use EC2 and it's pretty much a guaranteed way to crash the VM. I use the Amazon Linux v1 and v2 and both have had issues.

I haven't tried the Ubuntu kernel but if that doesn't have issue, then we can possibly narrow it down to Amazon's kernel.

Edit: Looks pretty solid on Ubuntu v20.04 kernel.

Edit2: Died after installing eslint extension remotely.

clshortfuse commented 3 years ago

Usage just balloons up. I'm wondering if it's Amazon that kills the server for using all the CPU credits. SSH will connect after this happens, but nothing else:

OpenSSH_8.1p1, LibreSSL 2.7.3
debug1: Reading configuration data /Users/carlos/.ssh/config
debug1: /Users/carlos/.ssh/config line 1: Applying options for 18.206.XXX.XXX
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 47: Applying options for *
debug1: Connecting to 18.206.XXX.XXX [18.206.XXX.XXX] port 22.
debug1: Connection established.
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1
debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.1

Edit: Even stopping and starting the instance won't allow it to keep working. It seems VSCode just uses up all the CPU credits and Amazon doesn't like that. The server won't even be allowed to start up since there's no credits left. I can't even open a regular SSH anymore, even after reboot.

pyg commented 3 years ago

My issues are on the EC2 Ubuntu image. I could start an instance and give Rob access to reproduce if he wants, some time later this week.

Keen

On Mon, Sep 14, 2020 at 10:56 AM Carlos Lopez notifications@github.com wrote:

[image: Screen Shot 2020-09-14 at 10 54 29 AM] https://user-images.githubusercontent.com/9271155/93101766-b0be6e80-f678-11ea-9059-e917d5ad5131.png

Usage just balloons up. I'm wondering if it's Amazon that kills the server for using all the CPU credits. SSH will connect after this happens, but nothing else:

OpenSSH_8.1p1, LibreSSL 2.7.3 debug1: Reading configuration data /Users/carlos/.ssh/config debug1: /Users/carlos/.ssh/config line 1: Applying options for 18.206.XXX.XXX debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 47: Applying options for * debug1: Connecting to 18.206.XXX.XXX [18.206.XXX.XXX] port 22. debug1: Connection established. debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem type -1 debug1: identity file /Users/carlos/.ssh/XXXX-key-pair.pem-cert type -1 debug1: Local version string SSH-2.0-OpenSSH_8.1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-692112713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABA7TBL7EXBYQUJDAWGROLSFYVLBANCNFSM4L4LP2OA .

clshortfuse commented 3 years ago

I'm pretty sure it's RAM. Opening up a project it balloons from up to 750M/979M of RAM. Closing VSCode drops it down to 164M/979M.

I opened up VSCode again and it's ballooned to 754M again. The biggest culprit is extensions/node_modules/lib/tsserver.js. I don't think the issue is so much the TS server itself at is is the lack of memory limit whatsoever. It'll consume memory as it sees fit until it halts the system. At this point, just opening a project leaves me ~170M to work with.

What's interesting is that the node process is run with --max-old-space-size=3072 which makes little sense on a 1GB machine. I'd wager it's actually worse since I'd imagine V8 would detect the system memory available and impose a more rational limit. We're pretty much instructing it to use more RAM that allowed and I guess a crash shouldn't be unexpected.

I've modified ~/.vscode-server/data/Machine/settings.json to use:

{
  "typescript.tsserver.maxTsServerMemory": 256,
  "files.maxMemoryForLargeFilesMB": 384
}

The defaults are 3072 and 4096. Let's see if this helps now.

Edit: That still caused a crashed, because it ballooned to over 90%. I tried using 64M for TsServer, but it seems 128 is a forced minimum. It's running with --max-old-space-size=128 and it looks more stable now. Memory drops to 400M when approaching 900M.

Edit2: Yep, TsServer is crashing because it runs out of RAM. Now I got an error saying TsServer has crashed 5 times in a row. The output is:

Remote	SSH: dev-test
OS	Linux x64 5.4.0-1021-aws
CPUs	Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz (1 x 2400)
Memory (System)	0.96GB (0.09GB free)
VM	0%

Better TsServer crashes because it's out of RAM than crashing the whole system though.

steelkorbin commented 3 years ago

The amount of RAM has nothing to do with it. This issue has taken down m5.large (8GB) , m5.xlarge (16GB) and m5.2xlarge (32GB) just as fast as any of the T series. I wish it was just a RAM capacity issue.

On Mon, Sep 14, 2020 at 3:20 PM Carlos Lopez notifications@github.com wrote:

I'm pretty sure it's RAM. Opening up a project it balloons from up to 750M/979M of RAM. Closing VSCode drops it down to 164M/979M.

I opened up VSCode again and it's ballooned to 754M again. The biggest culprit is extensions/node_modules/lib/tsserver.js. I don't think the issue is so much the TS server itself at is is the lack of memory limit whatsoever. It'll consume memory as it sees fit until it halts the system. At this point, just opening a project leaves me ~170M to work with.

What's interesting is that the node process is run with --max-old-space-size=3072 which makes little sense on a 1GB machine. I'd wager it's actually worse since I'd imagine V8 would detect the system memory available and impose a more rational limit. We're pretty much instructing it to use more RAM that allowed and I guess a crash shouldn't be unexpected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/2692#issuecomment-692344119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOWRHY3S2DW6JPLHOP6Z7DSF2JLJANCNFSM4L4LP2OA .

clshortfuse commented 3 years ago

@steelkorbin The amount of RAM is 100% the reason why mine crashes. I'm not sure why your servers are having issues though.

How much memory left do you have up until it locks up?

Zenexer commented 3 years ago

Edit: Even stopping and starting the instance won't allow it to keep working. It seems VSCode just uses up all the CPU credits and Amazon doesn't like that. The server won't even be allowed to start up since there's no credits left. I can't even open a regular SSH anymore, even after reboot.

You should still be able to boot a t#-series instance after using up all the credits, and the instance should continue to run if it's already running; it'll just run slower. If it's not booting at all or has stopped responding, that's a sign that something else might be wrong.

The amount of RAM is 100% the reason why mine crashes.

It seems to be pretty well established at this point that 1 GiB RAM is not enough and will cause a crash. That doesn't explain the "ticking time bomb" kernel panic that I was initially reporting, though--that happens regardless of available RAM, and it happens even if all VSCode processes have been killed. It could happen hours after the processes have exited, with no observed resource usage constraints.

@clshortfuse, note that this issue is very specifically for the whole machine crashing, without any logs. If it's just the VSCode processes that are crashing, especially due to resource constraints, that's probably a separate issue.

roblourens commented 3 years ago

I don't know what other resource would be consumed until the remote is unreachable. Does the number of open file handles increase constantly? lsof | wc -l (I'm just grasping at straws here)

hylowaker commented 3 years ago

With t2.medium instance, I experience this issue quite often that the CPU usage goes 99%.

This happens even with TypeScript plugin disabled.

patrickmau commented 3 years ago

I am experiencing the same issue with SSH Remote plugin when connecting to my AWS Lightsail instances.

My main instance (3GB RAM) stalls and becomes unresponsive within a few minutes after connection has been established. Command “htop” shows about a dozen or so identical vs-code-server processes that gradually fill up the RAM usage bar until all 3GB are used up and the instance becomes unresponsive.

I have disabled all plugins, my feeling is that there is a compatibility issue with VS Code and AWS as a whole which causes the SSH plugin to repeatedly start more and more service threads. It’s a real pity since I’d love to SSH integrated, it like this I had to instruct the team to ban VS Code for the time being.

This behaviour appears on Ubuntu 18.04 running on AWS Lightsail with 160GB HDD and 3GB RAM, as well as other instances with 2GB and 1GB RAM.

ScripterSugar commented 3 years ago

I can confirm that this is still an existing issue. I first doubt if it was my product that crashing the server and traced all the way down to watch and log whole processes and it's usages, even in clean EC2 instance with only my dev files (WITHOUT running it) and freshly installed VSCode remote SSH plugin kills the instance. It locks down the whole system and only way to recover it is completely stop the instance and restart it again, thus killing all PM2 instances & life-time configs which is very disappointing for dev instances. From now I'm stop using this plugin, and If someone wants to use this plugin NEVER USE IT FOR PRODUCTION AWS INSTANCE.

Edit: It wasn't RAM usage problem in my case too. There was always a tons of RAMs left when instance crashes, with CPU loads reach 100% right before the instance crashes.

steelkorbin commented 3 years ago

This issue is far more insidious than your average issue, but I do have this to contribute as an observable fact. AWS might be happy about this, for you conspiracy theorist out there, meh. Given AWS was a "NO-GO" for VS Code remote, I did what any common person would do, pivot away from AWS. I selected DigitalOcean for a comparison test, spun up a droplet, nothing on it, launched VS Code attached, 5 mins later "POW", dead, just like at AWS. At this moment, tears, frustration, sigh, ugh, and I am looking for my hammer to fix this. And then another few mins later while I am lamenting this horrible reproducible event, my remote session was disconnected by the remote, and VS Code began to try and connect automatically. Hum, this never happens at AWS, my eye's widened, hum interesting, a slightly different failure result. If you have been in the industry long enough, sometimes new failure effects can add to the diag story and ultimate resolution, so I saw this failure as a positive event. I stepped back, killed my VS Code's attempts to reconnect to the dead droplet. I waited, then flipped over to the stats page for that droplet. The CPU had been taken to 100% and pegged during my VS Code session, RAM usage was not changed with plenty more to give. Then following the death of my VS Code session the CPU drops back down, hum interesting, is the droplet still operable at this point, in AWS it would not be, or would it? So I did what any of us normally do while expecting a different result, I attached my VS Code again. It worked, the droplet was running, then after about 5 mins, "POW" dead, then shortly after, "disconnected". Checked the stats, same event in the logs. No errors just maxed to the CPU limit. So this has given me some insight to color this a little bit more. If you have any experience with AWS you know how slow it is to change state on an instance, we have all become accustomed to it and it is mostly a moot point, we all understand boot cycles, but AWS has the longest for stop events, it is just the way they do it is slow, DigitalOcean is just faster on this. Because of my need to get things up and running with a failing VS Code tool, I never waited long enough to see if my AWS instance would recover, might try it later, but it has no value going forward. Given that any instance/droplet maxes out the CPU allocated, the various cloud providers let us run hot for a short period to allow for spikes, but when such pressure potentially will impact neighboring customers, they clamp down and throttle your CPU within their usage policy, that is fair. So the Instance/Droplet drops away due to resource constraints, but will come back after it can sort out all the CPU overhead, nothing broken, nothing wrong with AWS or DigitalOcean, just common service providers policy dealing with VS Code activating something out of control. I don't know what is wrong, it is above my pay grade, I would place money on some background file watcher process getting hung up on files/dirs in Linux that it will not let VS Code watch or files/dirs that it should not be watching, just a guess, I am hoping the experts can see more than I can and sort this out.

markrosoftuk commented 3 years ago

Add me to the list, can reliably crash any spec EC2 instance (tried lightsail too which is basically an EC2 instance anyway) in a matter of minutes by just using vs code remote-ssh. Interestingly though, I can happily code php based projects for hours with no problem using remote-ssh, its when I'm doing react based projects that the server will crash. Cannot do anything other than forcefully shut it down on AWS and restart. Could be an issue with node.js?

ssprakhar commented 3 years ago

Add me to the list, can reliably crash any spec EC2 instance (tried lightsail too which is basically an EC2 instance anyway) in a matter of minutes by just using vs code remote-ssh. Interestingly though, I can happily code php based projects for hours with no problem using remote-ssh, its when I'm doing react based projects that the server will crash. Cannot do anything other than forcefully shut it down on AWS and restart. Could be an issue with node.js?

yes, disabling TS/JS plugin greatly improves the performance.

JakeTompkins commented 3 years ago

Unfortunately that's not much of a solution for those of us working in typescript/javascript

That said, the typescript server has been a scourge on my existence for a while. My old macbook air with 8gb of ram couldn't really handle a large react project due to the typescript engine hogging all the resources. I'm still pretty green in the industry so forgive me if this is a silly question, but is it possible that vscode's implementation of the server is bugged, watching directories outside the project root, or something similar?

bhakthil commented 3 years ago

Disclaimer: I am not a server expert and have very limited knowledge of *nix based servers.

I can confirm this issue is not just relevant to AWS or any cloud instances for that matter. We have an on-prem Ubuntu 20.04 server with 1T memory/16 cores. VS Remote debugger still causes kernel panic and the server shuts down abruptly. Unfortunately, this is a physical server and no-self healing. So, the server is not coming back up without notifying the SAs. Disabling TS/JS plugin has no effect on the server crash. As mentioned by some of the posters, a server crash is immediate and other times the crash happens after some time.

My project is mainly Python and only a very small number of files in the project. So, I don't believe this has anything to do with file size or the project size.

EDIT: We used to have Ubuntu 18.04 until it was upgraded to 20.04 a month ago. I have not had any issues prior to the upgrade.

yan3321 commented 3 years ago

Experiencing this issue with AWS Lightsail, 1 vCPU and 1GB RAM instance. I can work on Node.JS projects for a little while, until the server stops responding after a random period of time. Have to go into the Lightsail console and stop/start it. I've went back to using SFTP in the meantime, but I'm eager to see a resolution to this issue. The Remote - SSH extension feels a lot more nicer and intuitive to work with.

roblourens commented 3 years ago

VS Remote debugger

@bhakthil Which one? Python? Please file an issue on their extension.

clshortfuse commented 3 years ago

@roblourens So, I checked with lsof | wc -l

The highest I got with my project is around ~5000 opening stuff. Idle is about ~4500 handles. Closing the folder (project) brings it down to ~2100 and closing the VSCode connection drops it to ~1500.

Right now it's "stable" on my 1GB Ubuntu server, but that's because the JS/TS services run out of RAM constantly and crash the process. I have to unset these memory values and try again, but what I do find interesting is that the node do not die when I disconnect with VSCode. They linger on and I can see them on htop. I have to kill the process and then they go away.

After I kill those processes manually then it drops to ~700 handles (after disconnecting).

Edit: Every time I connect I get a few more handles open, and every time I disconnect more stay open. I can understand leaving a process running for a quick reconnection, but it seems there's a handle leakage somewhere.

Initial: 872
Connect: 2076
Disconnect: 1370
Connect: 2192
Disconnect: 1540
Connect: 2386
Disconnect: 1715

This is all without opening a folder/project (which would cause processes to crash from out-of-memory errors).

clshortfuse commented 3 years ago

Even resetting the memory values to default it seems okay now. Yeah, I know it's crashing all the time, but it doesn't kill the server (good!).

But the moment I enable the ESLint plugin (dbaeumer.vscode-eslint): dead. Like, immediately. Kills VSCode and my two terminal sessions. I don't remember if my previous issues had extensions running, but I now on VSCode 1.52.1, ESLint is a death knell. Maybe it's a file locking issue and when multiple processes are both trying to issue a file watch. I remember seeing webpack watcher errors, though I didn't look too much into it.

Edit: VSCode with no extensions can cause a slew of ENOSPC errors on webpack. Assume one for every file webpack interacts with. I'd imagine ESLint goes through something similar (but doesn't have a good error strategy):

Error from chokidar (/home/ubuntu/projects/materialdesignweb): Error: ENOSPC: System limit for number of file watchers reached, watch '/home/ubuntu/projects/materialdesignweb/README.md'

clshortfuse commented 3 years ago

So, on tests for chokidar@3.4.3 on NodeJS 12.14.1 which is the same as VSCode 1.52.1 fails on my SSH server.

215 passing (1m) 12 pending 2 failing

1) chokidar fs.watch (non-polling) watch symlinks should not recurse indefinitely on circular symlinks: Error: Circular symlink detected: "/home/ubuntu/projects/chokidar/test-fixtures/44/subdir/circular" points to "/home/ubuntu/projects/chokidar/test-fixtures/44" at ReaddirpStream._getEntryType (node_modules/readdirp/index.js:217:34) at async ReaddirpStream._read (node_modules/readdirp/index.js:128:31)

2) chokidar fs.watchFile (polling) watch symlinks should not recurse indefinitely on circular symlinks: Error: Circular symlink detected: "/home/ubuntu/projects/chokidar/test-fixtures/152/subdir/circular" points to "/home/ubuntu/projects/chokidar/test-fixtures/152" at ReaddirpStream._getEntryType (node_modules/readdirp/index.js:217:34) at async ReaddirpStream._read (node_modules/readdirp/index.js:128:31)

Rebooting the server, connecting only over terminal, and running chokidar tests shows 3 errors:

214 passing (1m) 12 pending 3 failing

1) chokidar fs.watch (non-polling) watch symlinks should not recurse indefinitely on circular symlinks: Error: Circular symlink detected: "/home/ubuntu/projects/chokidar/test-fixtures/44/subdir/circular" points to "/home/ubuntu/projects/chokidar/test-fixtures/44" at ReaddirpStream._getEntryType (node_modules/readdirp/index.js:217:34) at async ReaddirpStream._read (node_modules/readdirp/index.js:128:31)

2) chokidar fs.watchFile (polling) watch symlinks should not recurse indefinitely on circular symlinks: Error: Circular symlink detected: "/home/ubuntu/projects/chokidar/test-fixtures/152/subdir/circular" points to "/home/ubuntu/projects/chokidar/test-fixtures/152" at ReaddirpStream._getEntryType (node_modules/readdirp/index.js:217:34) at async ReaddirpStream._read (node_modules/readdirp/index.js:128:31)

3) chokidar fs.watchFile (polling) watch symlinks should properly match glob patterns that include a symlinked dir: AssertionError: expected addSpy to have been called with arguments test-fixtures/158-link/add.txt Call 1: "test-fixtures/158-link/change.txt" "test-fixtures/158-link/add.txt" [Stats] { atime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), atimeMs: 1609793899794.7793, birthtime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), birthtimeMs: 1609793899794.7793, blksize: 4096, blocks: 8, ctime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), ctimeMs: 1609793899794.7793, dev: 51713, gid: 1000, ino: 310347, mode: 33204, mtime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), mtimeMs: 1609793899794.7793, nlink: 1, rdev: 0, size: 1, uid: 1000 } Call 2: "test-fixtures/158-link/unlink.txt" "test-fixtures/158-link/add.txt" [Stats] { atime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), atimeMs: 1609793899794.7793, birthtime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), birthtimeMs: 1609793899794.7793, blksize: 4096, blocks: 8, ctime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), ctimeMs: 1609793899794.7793, dev: 51713, gid: 1000, ino: 310348, mode: 33204, mtime: Mon Jan 04 2021 20:58:19 GMT+0000 (Coordinated Universal Time), mtimeMs: 1609793899794.7793, nlink: 1, rdev: 0, size: 1, uid: 1000 } Call 3: "test-fixtures/158-link/subdir/add.txt" "test-fixtures/158-link/add.txt" [Stats] { atime: Mon Jan 04 2021 20:59:06 GMT+0000 (Coordinated Universal Time), atimeMs: 1609793946006.7793, birthtime: Mon Jan 04 2021 20:59:06 GMT+0000 (Coordinated Universal Time), birthtimeMs: 1609793946006.7793, blksize: 4096, blocks: 8, ctime: Mon Jan 04 2021 20:59:06 GMT+0000 (Coordinated Universal Time), ctimeMs: 1609793946006.7793, dev: 51713, gid: 1000, ino: 560704, mode: 33204, mtime: Mon Jan 04 2021 20:59:06 GMT+0000 (Coordinated Universal Time), mtimeMs: 1609793946006.7793, nlink: 1, rdev: 0, size: 1, uid: 1000 } at Context. (test.js:1225:31)

I can see if chokidar fails, so can VSCode and its extensions. Since we're talking symlinks this could be an infinite loop issue.

bhakthil commented 3 years ago

VS Remote debugger

@bhakthil Which one? Python? Please file an issue on their extension.

The problem occurs regardless of the extension(s). The server crashes without any extensions installed.

LiandriCorp commented 3 years ago

I hate to "me too" but I must add that this exact same behavior is hitting us in GCP. It mirrors everything described above.

jordansavant commented 3 years ago

Can confirm has been happening to my team for months on a variety of EC2s running Ubuntu (at different versions and on different instance sizes).

bhakthil commented 3 years ago

I was able to fix the issue by manually installing vs-code server in my Ubuntu environment. It looks like the log file corresponding to the commit got corrupted and was causing the broken ssh connections. .log file in the .vscode-server folder can be deleted safely without breaking the server. Haven't had the issue since.

steelkorbin commented 3 years ago

Alright! bhakthil I think your on to something here! I just logged in direct ssh, deleted the old .log and .pid in the .vscode-server/ of my root user login. Now my droplet has been running longer than it was and it is still going. This is really making sense in my mind, I figured it was a file that vscode could not access or should not be accessing, did not think it would be one of its own files. I am going to dig into this today and stress test this. I might even get the detail on the issue with some brand new droplets. Looks very encouraging bhakthil, thanks for the mention, I will let you know how this goes. Kind of excited if this works out.

steelkorbin commented 3 years ago

The full CPU is not an issue here, I have a service running and putting things under pressure, vscode is still excellent, fast, nimble, and responsive. No lockup happening, even OM events are not taking the droplet down, getting really excited now.

clshortfuse commented 3 years ago

On v1.53 the CPU goes up to 100% and kills the Amazon EC2.

On Insider (~v1.54) with the new chokidar (3.5.1) and CPU is not going up to 100% anymore. But it's still killing the server, just not for reasons of CPU anymore.

The first death is v1.53 at 18:29. The second is ~v.1.54 at 18:43.

My testing is 1GB RAM, ESLint plugin, and running a task that spawns webpack v4.

Edit: Worth noting, on v1.53, I need to force stop the EC2. With v1.54 the instance could safely be stopped. Seems like it's responding enough for Amazon metrics, but not enough to SSH into it?

clshortfuse commented 3 years ago

Tried it again. Hit 4400 open handles and that's all she wrote. Just locked up exactly like that^

CPU hit 100% on v1.54, and Amazon was unable to get it to respond. Had to force stop.

Edit: I was only opening 3 JS files with type-checking/intellisense.

steelkorbin commented 3 years ago

clshortfuse, I feel your pain brother, but I think your issues are another topic. I just got done pounding my droplet out of existence until vscode could not get in a word edgewise. I dropped the remote session and reopened it, presto, it was back in business, the droplet was not locked up and dead as it was in prior tests/use. I am leaning towards the bad access of .log and/or .pid bhakthil mentioned as the fatal issue we have all been fighting before we could even start to look at other bad code that blows out the CPU and Memory.

clshortfuse commented 3 years ago

@steelkorbin Right, but the files you are describing don't exist on my side. I don't have that exact issue.

Everything leads to it being a file system issue related to the JS/TS services. How the issues present is different based on the operating system. On my EC2 with Webpack and in the original commenter's it's shown as file watcher issues. As I continue to open more handles, it increases the chance of halting the system. I can run VSCode fine if I never open ESLint and don't run execute Webpack scripts.

The fact the CPU is going at 100% leans to there being some sort of infinite loop. My guess it's recursive symbols. If SSH doesn't fail to respond on whatever machine you're using doesn't mean it's really solved. That's up to the OS to decide is SSH can continue work with 100% CPU. With Amazon, it doesn't. Still, 100% isn't normal and if you say that you're using external services that are causing 100% and it's unrelated, I suggest you modify the setup to build a better control group for testing. We can't just ignore 100% CPU use.

steelkorbin commented 3 years ago

You are absolutely correct and I agree. I had just put my droplet under pressure to see if I could kill it and if vscode would operate as expected, with the log file delete mentioned, that happened successfully. But your right, the actual cause of the issue on the files might extend to more files than just these or the way the JS/TS services builds files that might need some serious critical review. I am encouraged that I might have a chance to even look at these other issues now, they might be holistic to the whole problem as you may well know. I agree with your insight here.

clshortfuse commented 3 years ago

@roblourens Is there some sort of log file we can read, or a flag we can set so we can see what's causing the 100% CPU? Perhaps a debug flag?

I'd imagine that file would spit a bunch of lines repeating the same thing. Also, if a VSCode process crashes, is there a backoff strategy? I can see file watching failing or out of memory causing a crash, and VSCode continually spawning up the process ad infinitum with 0 delay, bringing the server to its knees.

Edit: Checked ~/.vscode-server-insiders/data/logs/20210212T184922/remoteagent.log and the last entry is:

[2021-02-12 18:55:53.969] [remoteagent] [error] [File Watcher (node.js)] Failed to watch /home/ubuntu/.vscode-server-insiders/bin/6ac9a3ecb3698e82bf901f11bbb5940f6bc3c197/extensions/node_modules/typescript/lib/lib.dom.d.ts for changes using fs.watch() (Error: ENOSPC: System limit for number of file watchers reached, watch '/home/ubuntu/.vscode-server-insiders/bin/6ac9a3ecb3698e82bf901f11bbb5940f6bc3c197/extensions/node_modules/typescript/lib/lib.dom.d.ts')

IIRC, the last thing I did was open that file and the server stopped responding (CPU stayed 100%).

Worth noting, I guess, we're not using chokidar at all, but the Node.JS variant. I wonder if I can force chokidar...

Edit2: I was able to easily replicate it again. I open up a project with a DOM reference and open lib.dom.d.ts. I get the same error as pointed above in remoteagent.log:

[2021-02-13 14:28:57.752] [remoteagent] [error] [File Watcher (node.js)] Failed to watch /home/ubuntu/.vscode-server-insiders/bin/6ac9a3ecb3698e82bf901f11bbb5940f6bc3c197/extensions/node_modules/typescript/lib/lib.dom.d.ts for changes using fs.watch() (Error: ENOSPC: System limit for number of file watchers reached, watch '/home/ubuntu/.vscode-server-insiders/bin/6ac9a3ecb3698e82bf901f11bbb5940f6bc3c197/extensions/node_modules/typescript/lib/lib.dom.d.ts')

max_user_watches is set to 8192, but I've only once seen it hit 6000 open handles with lsof | wc -l. Usually it's at 4000 when it hangs.

Edit3: Just hung again after setting max_user_watches to 524288. Last open handle count was 4268

microsoft / vscode-remote-release

Remote VSCode over SSH crashes EC2 instance #2692