GeoNetwork server out of space

kkoch commented 7 years ago

VM (GN25.glos.us; 64.9.208.34 - a VM on the server 192.168.79.111) that hosts GeoNetwork filled up on disk space.

The GN log files have been significantly running rampant such that the catalina-daemon.out file was over 151GB in size. When looking at the logs (some of which were also significant in size), appears that the massive logging might be related to the Portal hitting GN

Example Log hit > 73.168.47.143, 192.168.76.104 - - [26/Apr/2017:00:37:26 +0000] GET /metadata/srv/eng/resources.get?uuid=b303e50f-4615-4d13-bafa-dae868f9f65f&fname=sondes_sm_s.png HTTP/1.1 200 66297 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0 http://portal.glos.us/?gclid=CMnVs9XwwNMCFda4wAodIwgBjw

Munin snapshot: disk space

@gcutrell was able to clear out catalina and get back to 20% disk utilization. GN then did not come back up (even after a restart). Appeared then that a lucene indexing error was occurring (which seems to have started on 4/26 and might have been related to the disk issue) preventing restart. Greg was able to restart the entire server which allowed GN to come back. There does not appear to be any loss of data.

kkoch commented 7 years ago

@lukecampbell > Pulling you in to see if you have any ideas on either some ways to whitelist the portal hits from the logs or programmatic ways we can purge them on a regular basis. I'm also researching the GN system manual to see if I can glean something there.

[in the interim Greg and I now will be keeping a closer eye on the disk utilization]

lukecampbell commented 7 years ago

There's a service on most if not all linux distributions that handles cleaning up logs and logfiles called logrotate.

beckypearson commented 7 years ago

Do we know when the disk space for this server became a problem? Is Greg working on it? Has this been fixed yet?

Just want an update. Once it is fixed, we should document this in a brief, postmortem report.

kkoch commented 7 years ago

It is fixed. Greg just needs to document his end of things. (And I wanted Luke's feedback on the logrotate so it may remain open until we get something like that in place). But GN itself is back and we just need to put something in place to keep it from occurring again.

beckypearson commented 7 years ago

Good to hear it is fix. How much down time was there for that server? Again, a brief report would be great, or it should be integrated in our quarterly report on GLOS infrastructure which we start doing.

kkoch commented 7 years ago

I don't believe it needs to be reported as the main person affected was me. I was unable to add data starting at about 3pm on Wednesday (and was not working on Thursday so figured it could wait until today). However, GN should have still been available to the general user and portal until this morning where it was down for less than an hour while Greg and I figured out what needed to be done to clear the disk space and reboot the server. And it only took that long because it was a bit of a learning curve for both of us. I believe that it would have been transparent to the user because they would only have noticed if they were trying to click on the metadata link from a Portal record during that small window of complete down this morning. Plus if it occurs again (which it should not), it would be a matter of a few minutes to clear now that we know what needs to be done.

glos / myglos

GeoNetwork server out of space #117