Closed kkoch closed 7 years ago
@lukecampbell > Pulling you in to see if you have any ideas on either some ways to whitelist the portal hits from the logs or programmatic ways we can purge them on a regular basis. I'm also researching the GN system manual to see if I can glean something there.
[in the interim Greg and I now will be keeping a closer eye on the disk utilization]
There's a service on most if not all linux distributions that handles cleaning up logs and logfiles called logrotate.
Do we know when the disk space for this server became a problem? Is Greg working on it? Has this been fixed yet?
Just want an update. Once it is fixed, we should document this in a brief, postmortem report.
It is fixed. Greg just needs to document his end of things. (And I wanted Luke's feedback on the logrotate so it may remain open until we get something like that in place). But GN itself is back and we just need to put something in place to keep it from occurring again.
Good to hear it is fix. How much down time was there for that server? Again, a brief report would be great, or it should be integrated in our quarterly report on GLOS infrastructure which we start doing.
I don't believe it needs to be reported as the main person affected was me. I was unable to add data starting at about 3pm on Wednesday (and was not working on Thursday so figured it could wait until today). However, GN should have still been available to the general user and portal until this morning where it was down for less than an hour while Greg and I figured out what needed to be done to clear the disk space and reboot the server. And it only took that long because it was a bit of a learning curve for both of us. I believe that it would have been transparent to the user because they would only have noticed if they were trying to click on the metadata link from a Portal record during that small window of complete down this morning. Plus if it occurs again (which it should not), it would be a matter of a few minutes to clear now that we know what needs to be done.
VM (GN25.glos.us; 64.9.208.34 - a VM on the server 192.168.79.111) that hosts GeoNetwork filled up on disk space.
The GN log files have been significantly running rampant such that the catalina-daemon.out file was over 151GB in size. When looking at the logs (some of which were also significant in size), appears that the massive logging might be related to the Portal hitting GN
Example Log hit > 73.168.47.143, 192.168.76.104 - - [26/Apr/2017:00:37:26 +0000] GET /metadata/srv/eng/resources.get?uuid=b303e50f-4615-4d13-bafa-dae868f9f65f&fname=sondes_sm_s.png HTTP/1.1 200 66297 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0 http://portal.glos.us/?gclid=CMnVs9XwwNMCFda4wAodIwgBjw
Munin snapshot:
@gcutrell was able to clear out catalina and get back to 20% disk utilization. GN then did not come back up (even after a restart). Appeared then that a lucene indexing error was occurring (which seems to have started on 4/26 and might have been related to the disk issue) preventing restart. Greg was able to restart the entire server which allowed GN to come back. There does not appear to be any loss of data.