kframework / kweb

Online extensible IDE for the K Framework and other formal verification projects. Example deployment at http://kframework.org/kweb/
5 stars 5 forks source link

[critical] K or kweb or something opens up too many files on fslweb #19

Open pdaian opened 10 years ago

pdaian commented 10 years ago

From Joel in IT:

Grigore,

Fslweb is back up and serving the website.  We are, unfortunately not in control of the network update schedule, but we do apologize for the outage.  That said, here is some technical information regarding what happened.

The logs were recording many messages of this form:

Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error parsing timestamps file '/var/lib/NetworkManager/timestamps': Too many open files
Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error saving timestamp: Failed to create file '/var/lib/NetworkManager/timestamps.F7ZEOX': Too many open files

Days and perhaps months prior to the outage.  This leads me to believe there was an issue already present that the network

Nov  1 05:40:40 fslweb NetworkManager[1795]: <warn> sysctl: failed to open '/proc/sys/net/ipv6/conf/eth0/accept_ra': (24) Too many open files
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4364] [nm-device.c:3486] nm_device_update_ip4_address(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4476] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <info> (eth0): bringing up device.

Is the time of the actual network outage.  You can see that it fails to come back up due to a lack of available file handles.  It then spams the same line repeatedly up until I restarted the machine this morning.

Nov  2 03:45:52 fslweb NetworkManager[1795]: <error> [1414921552.22183] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.

Checking the max open file handles, 1,620,366 is the number of files the system will open concurrently.  That's a million and a half open files.  From checking the backup stats on the machine it looks like the machine itself has almost 7 million files in just 115GB of space.  This leads me to believe that the issue that caused the machine to not come back up after the networking outage was the open files, not something directly related to the network.  I need to run to a meeting, but I'll provide additional information this afternoon.

Joel
kheradmand commented 10 years ago

I guess anon folder is for saving unregistered users' files. If this is the case, we should remove those files once anonymous user's session ends (or for example after 24 hours)

Right now we have ~1M (925975) files in /srv/kweb/kfiles/anon folder.

kheradmand commented 10 years ago

Also we have this function:

def get_file_meta(): ... return open(collection.get_collection_path() + path + file + '.meta').read() …

I'm not sure whether python automatically closes these files or not.

pdaian commented 10 years ago

Python's garbage collector is supposed to close file handles. I'm not sure why that isn't happening if it is a Python issue. The garbage collector is definitely working because our memory isn't infinitely growing.

kheradmand commented 10 years ago

then that is most probably not the the problem :)

pdaian commented 10 years ago

It still may be the issue :+1: ... could be a bug in Python or maybe we're running a really old version. I agree with you that we need to do file cleanup though, it's been on my list for a while but right now it's just a manual thing. Worth noting the anon folder has reached over 10GB before (~1 year of use) with no problems. I did clean up all the files a few days before the crash so maybe this was somehow a consequence of that. Going to be hard to say without some more investigation.

kheradmand commented 10 years ago

One thing that I've noticed in the file list that Joel sent is that all files in for example ./kweb/kfiles/anon/8a9dfe95-fe26-4883-b974-c239b2db4064/ were open on the server. The only command that I've find so far in your code that touces all those files is shutil.copytree but that does not help

kheradmand commented 10 years ago

oh I just realized that Joel said that those files were 'created' not 'open'. Do you have access to list of currently open files on server ? I do not have.

pdaian commented 10 years ago

You can do sudo lsof if you have root access. Nothing out of the ordinary there right now, well under 7k files open.

kheradmand commented 10 years ago

Hmmm Thanks :)

kheradmand commented 10 years ago

So as far I investigated, no new files get opened over time. But just in case, I created a cronjob to look for new files that are open, everyday.

grosu commented 10 years ago

Guys, if we should get a new fslweb machine/server, please let me know.

Grigore


From: Philip Daian [notifications@github.com] Sent: Monday, November 03, 2014 7:18 PM To: kframework/kweb Subject: Re: [kweb] [critical] K or kweb or something opens up too many files on fslweb (#19)

It still may be the issue [:+1:] ... could be a bug in Python or maybe we're running a really old version. I agree with you that we need to do file cleanup though, it's been on my list for a while but right now it's just a manual thing. Worth noting the anon folder has reached over 10GB before (~1 year of use) with no problems. I did clean up all the files a few days before the crash so maybe this was somehow a consequence of that. Going to be hard to say without some more investigation.

— Reply to this email directly or view it on GitHubhttps://github.com/kframework/kweb/issues/19#issuecomment-61579000.

pdaian commented 10 years ago

@grosu I don't think a new server is required. We will eventually definitely need to move kweb to the cloud though, because if we get several people using K at once it's already too much CPU for any single machine to handle well.