liftoff / GateOne

Gate One is an HTML5-powered terminal emulator and SSH client
http://liftoffsoftware.com/Products/GateOne
Other
6.28k stars 925 forks source link

GateOne memory issue #296

Open wcypierre opened 11 years ago

wcypierre commented 11 years ago

Hi,

In my previous issue, I've asked on how to apply a limit for the number of terminals that can be created. However, I've noticed that even though you're only allowing one terminal to be spawned, but if the user repeatedly closes the terminal(the X button inside the GateOne interface), but still, the old python process won't be killed and yet the new python process will be spawned, effectively increasing the memory usage.

Steps to reproduce:

  1. Open a process monitor application and observe the gateone process
  2. Access the gateone terminal and close the terminal via the inline close button multiple times
  3. Observe the process monitor application

Testing Environment: OS: xubuntu 13.04 amd64 and ubuntu 13.04 amd64 Python version: 2.7.4 GateOne version: 1.1 and 1.2(tested by forking the latest revision and the same thing happens) Plugins: dtach, pillow, stdeb, slimit, cssmin, mutagen Settings: default settings with the number of terminals that can be spawned limited to 1

Any solutions for this problem?

liftoff commented 11 years ago

I just tried this and I can see memory consumption growing... But it might just be due to the fact that there's a delay between when you close a terminal and when Python calls the __del__() method on the Multiplex instance. In other words, the Python interpreter should free up that memory eventually.

I am doing a detailed analysis of it now though. I'll definitely keep this open until I've found out for certain that there isn't a real memory leak at work.

liftoff commented 11 years ago

OK, after some testing I don't think that this is a memory leak. There's definitely an upper bound as to how much more memory will be used when terminals are closed repeatedly.

For me (using the same exact version of Python as you) the gateone.py process starts out at around ~17,200k (single user connected) and grows to around ~22,900k at which point it stops growing no matter how many terminals I close.

If i reload the page a bunch of times and keep closing terminals I can get it up to around ~28,000k which is to be expected from cached templates, memoized policy information (yes, it does this), and whatnot.

Now that I've got it up that high I'm going to leave it overnight. If I'm correct the memory utilization should drop considerably when I check it tomorrow morning. If I'm wrong it will still be hovering above the ~22,000k mark.

For reference, if you run gateone.py with --logging=debug you can see the __del__() calls where the Python interpreter is removing those leftover Multiplex instances. I also recommend just holding down Ctrl-Alt-W when testing (closes terminals pretty quick that way).

I haven't done serious process profiling in a while... I'll definitely put gateone.py through its paces tomorrow once I've got the results from the overnight tests.

wcypierre commented 11 years ago

I'm not particularly sure myself for whether if this is a memory leak as I've just found it out recently myself(have not gone through much testing yet). And actually, I was talking about the ssh_connect.py instead of gateone.py. The gateone.py doesn't consume that much of memory(even though you kept creating the instances over and over again) as compared to ssh_connect.py in total, as for every terminal, one ssh_connect.py that uses 8.8mb will be spawned, 1 dtach which is around 640kb and 1 ssh for 648kb, which is almost 10mb for 1 ssh instance). My initial concern when I reported this issue would be on what if the user keeps closing the terminal, which will spawn another terminal and keeps doing so, reducing the amount of usable RAM and in the end overloading the node.

wcypierre commented 11 years ago

I've been running it for a few hours(around 6-7 hours now), and it seems like the ssh_connect.py is still not freed yet

wcypierre commented 11 years ago

and I've tried to remove dtach and retried it and it seems that there's a memory leak issue here(although I may be wrong here).

liftoff commented 11 years ago

Huzzah! Overnight my gateone.py process went from ~22,000k down to ~12,600k. That was with a single connection sitting idle overnight. Note that I actually have several plugins and applications running other than just the Terminal (secret stuff; shh! :wink: ). So your gateone.py process will probably be a bit smaller.

I also double-checked the garbage collection situation by executing:

GateOne.ws.send(JSON.stringify({'terminal:debug_terminal': 127})); // 127 is the terminal in question

...in my JavaScript console. Whenever Gate One gets the "terminal:debug_terminal" WebSocket action it calls gc.collect() and prints out the number of "unreachable objects" (among other things). The line in question looks like this:

gc.collect(): 9495

So in that time there were 9495 unreachable objects which isn't too bad considering I closed 126 terminals in a row before I left it (when you have a serious memory leak that number will go up into the millions--sometimes =). Subsequent calls to "terminal:debug_terminal" will reveal that the number of unreachable objects drops to 0. So that means the garbage collection worked.

Something else to note: When you make that debug call it imports the gc module which will increase memory utilization. Also, the very act of collecting garbage increases memory utilization slightly. Furthermore, calling gc.collect() has the tenancy to pull memory out of swap and back into main memory... Taking your memory utilization higher (even if only temporarily).

I also want to mention an interesting (soon-to-be-documented-I-swear) feature of Gate One that's relevant to this issue: When the last user session times out (controlled with the session_timeout setting) gateone.py will restart itself. When that happens it will free up a lot of memory and be extremely idle. What I mean by that: inside of Gate One there's all sorts of self-maintenance and checking operations going on at regular intervals. Those things don't actually get kicked off until the first user authenticates (via the WebSocket). So a gateone.py that hasn't been used in a while (default: 5 days) will automatically reduce its memory footprint and CPU utilization to as minimal as possible.

I haven't done serious profiling in quite some time so I figure now's as good a time as any. I have to fix a bug related to dependency injection on the JS side of things but once that's done I'll whip out my mad profiling skillz :laughing:

wcypierre commented 11 years ago

Don't worry, I'll keep it between us ;)

For your single connection, did you connect to another server/localhost before you start idling it? As I've noticed that only the ssh_connect.py is eating a lot of memory and it is not freed(it is freed only after you've provided the login details, having entered hostname, port and username, leaving the password to be entered. The ssh connection will be initiated and the ssh_connect.py will be terminated and at there, the ram usage will be much lesser compared to when ssh_connect.py is on(which is the condition where the user has not done anything to the terminal)

and, I've thought that the session_timeout has been implemented lol as I've saw that in the 10server.conf and I've even set the time to 1 day(which I haven't tried it out yet as I found this issue).

liftoff commented 11 years ago

I really should do profiling more often... I just pushed a commit that is awesome! It reduces CPU consumption by half for the biggest CPU-hog in all of Gate One (the terminal emulator's spanify_screen() function). Not only that but the way I pulled it off is wicked cool (if I do say so myself =).

I was also able to significantly reduce the speed at which memory of the python interpreter grows when you close terminal after terminal over and over again. Having said that, it still seems to grow to a certain level and then just stay there until an extended period of idling. I don't think I can do anything about that--it's just Python's memory management at work. Fortunately the place it stops growing is really quite small (~22-28MB depending on any number of factors); about 0.007% of the memory on a 64-bit 4GB system. On 32-bit systems with less RAM it will be a lot smaller.

Now, in regards to how much memory ssh_connect.py uses: Yeah, it's kind of high for what it does. However, it does free up all that memory once the connection is established. I have taken your concerns under advisement though and I'm going to see if I can implement a timeout that just has ssh_connect.py kill itself after a few minutes of inactivity. What do you think of that?

wcypierre commented 11 years ago

Cool, will do a fresh git clone then :)

Yup. It does free the memory once the connection is established, but my concern was on assuming that you have a server with 16GB RAM, and what if 1600 users spawn one terminal each? it will overload the server and apparently it is even more effective than floods like tcp or udp flood(1600 requests can't do much for a server with 16GB RAM lol)

yeah, that workaround would be good as it would fend off those users who just access gateone for fun and yet not doing anything. btw, just suggesting, but is it possible to add an init.d argument to kill these inactive terminals(ssh_connect.py)? like service gateone killinactiveterms so that I can setup a monitoring script and if the RAM goes high then I'll execute this command to perform a cleanup.

liftoff commented 11 years ago

Well, the session_timeout is supposed to take care of that. 1600 terminals--when you have each user limited to 1 each--means 1600 browsers out there in the world sitting idle. Sooner or later they'll navigate away from that tab and the session will die after your timeout.

I like the idea of having a memory watcher though that kills idle terminals if a threshold is reached. Not sure how I'd implement that but I'll do some research.

wcypierre commented 11 years ago

That is just an example though. There can always be an abuse user that spawns a lot of terminals within the same computer itself(although it doesn't apply to my case).

Just saying but I'm planning to use GateOne to promote the usage of command line(educational purpose) and there's a security course at my place here so I gotta be more careful on handling the vps to prevent these students from actually dos~ing the vps as my vps's ram is not really much(1gb, so 100 terminals will kill my setup but the amount of users are not much as well).

That would be really cool though, as I would not need to setup the script myself(which is pretty hackish most of the times) :)

liftoff commented 11 years ago

I am still investigating the memory consumption of ssh_connect.py but I just pushed a commit that reduces the initial memory consumption considerably. I basically switched from importing everything at the top of the script (which every guide to Python says you're supposed to do; whatever!) to dynamic imports as-needed in each individual function. I even replaced a single compiled regular expression with a function so that I could move the import re line inside of it (to be called when needed).

On my system these simple changes reduced the initial memory footprint from 5,508k to 3,396k. There's got to be a way I can reduce that some more though (maybe?). I may just have to re-write the damned thing in shell (ugh). It does a lot of stuff. Way more than you'd think something as simple as a wrapper around SSH would do.

wcypierre commented 11 years ago

First and foremost, thanks for the effort though. Really appreciated it.

However, I am unable to test it as there is an issue of cpu ramping whilst generating the ssl keys using openssl(the cpu usage will ramp to 99% and I had to shutdown my vps to prevent it from being banned for abuse)

It was done on a clean setup btw.

liftoff commented 11 years ago

That's a really strange problem. Did you install the pyOpenSSL module? That might fix it for you. Gate One only calls the openssl command directly if pyOpenSSL can't be found. The pyOpenSSL way is quite a bit faster (no idea why--they generate the same exact key/certificate).

wcypierre commented 11 years ago

Apparently, it works fine when pyopenssl is installed. But for the previous versions, I've been using openssl and it is working fine(the cert gets generated without any issues), but in the latest commit, the openssl keeps running and it uses 99% of cpu in top's process column and unfortunately, I will not be able to test it anymore as I'm running it on a vps and such usage is considered an abuse so I don't want to get my account suspended. Sorry about that it though.

liftoff commented 11 years ago

What VPS provider are you using? Can you give me more details about the environment? I'll see if I can replicate it. I mean, it's a really interesting issue as it could be any number of things.

Gate One's utils.py includes a function, timeout_func() that will call any function you want and if that function doesn't complete within a specified timeout it will kill the function. The function that calls the openssl command, gen_self_signed_openssl() uses timeout_func() with a 30-second timeout (calls openssl multiple times for each step of the process). So if the openssl command is running in the background gobbling up CPU for more than 30 seconds something isn't working properly. Gate One should automatically kill it and generate an error message.

Did you wait at least 30 seconds before killing gateone.py? I'm guessing that either your VPS is just really, really slow (slower than my router, even; probably under heavy load at the time) or their copy of openssl is broken.

wcypierre commented 11 years ago

Host: http://liquid-solutions.biz/. Plan: Level 2 1GB openvz at Seattle Template: ubuntu 13.04 x86_64

However, it doesn't seems to be like the host's issue, as I've spawned a DO instance and the same thing happens as well.

Here's the script that I used to setup gateone: https://gist.github.com/wcypierre/6567867/raw/315696238e49c7363f34682b038ff093cf6fccd2/gateone_setup.sh

wcypierre commented 11 years ago

It works well when pyopenssl is installed, though.

liftoff commented 11 years ago

Well, I did some testing and speed appears to be the issue. They're just slow VPSes--at least when it comes to the type of computation required by openssl genrsa. VPSes--in general--are often optimized for delivering content (IO) and not for any sort of complex computation. If you let the openssl command run for a while it will probably spit out a proper key--eventually.

I suspect that the VPS provider in question is probably placing limits on processes spawned by fork() which would explain why it runs slower than usual. Fork protection is quite a useful feature for shared resources because it will help stop fork bombs and the like. This would explain why it runs fast inside of gateone.py but slow when executed inside of a fork'd shell.

Also, I have a suggestion for your gateone_setup.sh script: Add slimit and cssmin to your pip install line. That'll have Gate One minifying all JS and CSS before it gets sent to clients (saving bandwidth, mostly).

wcypierre commented 11 years ago

It seems to be a reasonable reason. I've already kept it running for 5 minutes and it seems to be still generating the cert so I gave up and used pyopenssl instead.

Have already replaced it in accordance to your suggestion.

Anyway, really appreciated the help and I will be awaiting for the inactive-terminal-killer :)

liftoff commented 11 years ago

Hey there... Just wanted to let you know that I just pushed a commit that should address one of the concerns you posted previously: How to handle 1600 users with one terminal open each? I added an idle timeout to ssh_connect.py!

So if a user connects to Gate One, opens a terminal, then never makes an outbound ssh/telnet connection that ssh_connect.py process will kill itself after 120 seconds. Well, it executes a shell script that replaces it's process resulting in only a teeny amount of memory being used (as opposed to the multiple megabytes that ssh_connect.py normally uses).

Hopefully this new feature will help out your use case.

wcypierre commented 10 years ago

Hi, first and foremost, sorry for the late reply as I was a bit busy recently. Back to the main topic, considering that the process has already timed out, could you kill the related processes like dtach and sh so that it wouldn't clog the RAM? I think that it would be possible as the sh and dtach is a child of the main gateone process so getting the pid from ppid shouldn't be a problem.

wcypierre commented 10 years ago

it would be better than using CTRL+C to kill the processes as processes that has already timed out should be killed anyway(but I believe that the CTRL+C is still needed though, but more of a cosmetic to refresh the terminal rather than to refresh the terminal + killing the remaining processes)

wcypierre commented 10 years ago

btw, I have realised that you've changed the behaviour of the max term. You've changed from beeping and closing the terminal when it reaches the threshold to not displaying anything when it reaches the threshold. However, probably due to the need to count the number of terminals being spawned or such, the memory usage increases if you leave it overnight.

Can you please take a look at it(I've changed the execution of program from ssh_connect.py to /bin/login, if that matters)? :)

liftoff commented 10 years ago

Confirmed: There's definitely a bug with the max_terms policy that is preventing the notification from being sent. I'll fix that. It is supposed to be sending a notification to the user indicating why the terminal didn't open and also clean up any loose ends at the client (so they don't get an empty container)

As for the Ctrl-C/process control thing: It's just a compromise. If I actually kill the terminal Gate One will automatically open a new one (if no other terminal commands are configured). Also, the user could return to a terminal they left open and find, well, nothing in its place. No message indicating why it was closed since it would have faded away by then.

I'm pondering disabling the feature that automatically spawns a new terminal when the last one closes (if no other commands are configured) but that would be a kind of break with backwards compatibility.

wcypierre commented 10 years ago

I was wondering of suspending the spawn of a terminal until the user presses CTRL+C(so CTRL+C would be the trigger to spawn the terminal, but that could lead to some new issues as well), how's your opinion bout it?

offtopic: can you take a look at Order #108? ;)

liftoff commented 10 years ago

I'm not sure what you mean... You want a user to have to press Ctrl-C to open new terminals? That would be wacky but you could implement this yourself in a simple shell script wrapper (around whatever command you like).

wcypierre commented 10 years ago

no, what I meant is to kill the forked python processes when the ssh_connect.py times out and respawn the terminals when CTRL+C is pressed - which is what gateone is doing now(I've changed it to use /bin/login and it refreshes the thing every now and then).

tl;dr: ignore what I've said earlier

liftoff commented 10 years ago

Does that mean I can close out this issue?

wcypierre commented 10 years ago

yeah.

offtopic: before I forget, can you take a look at Order #108 for your paywhatyouplease donation?

liftoff commented 10 years ago

I did look at that order... I replied to it already. Did you not get my direct reply?

wcypierre commented 10 years ago

nope, did you sent it via email or you've put it at the order page? As I have not been able to access the order page, it says "Access Denied"