Closed rufuspollock closed 10 years ago
We're also seeing downtimes ...
A suggestion from someone checking this out: we may have a misconfigured Max Clients directive in apache which results in loads of workers. I think the best method to fix this permanently is to run ckan on gunicorn and put it behind the already running nginx
@rossjones you've also mentioned adding a few indexes on the activity table (this would be better going into core ckan then us doing the upgrade ...)
Not sure MaxClients (this is the number of concurrent requests, not processes) is relevant as the number of processes/threads we use is defined for the WSGI app (we are using Daemon mode). We're certainly not seeing the 136 processes we'd expect (given current setting).
I've turned off nginx cache and put varnish back again (it's a much better cache than nginx) and the performance is back to reasonable levels (but not really great levels). Next time I do a big deploy I'm going to move over to nginx->gunicorn as Apache's a bit unnecessary.
I've also temporarily blocked BaiduSpider because:
I totally failed to install iftop to check out how, and how much bandwidth is being used. apt won't let me install it.
The indices are already added in core I believe, and only really affect the painfully slow login.
@rossjones great work - i disabled varnish a month or so ago just because i was trying to debug and it seemed we had 2 layers (i.e. nginx and varnish were doing caching - tho' may have misunderstood).
Site is certainly a lot zippier so great work :-)
@rossjones I'm now seeing from varnish cache server:
Error 503 Service Unavailable
Service Unavailable
Guru Meditation:
XID: 1493797728
Which now I recall is why I switched varnish off last time. I rebooted varnish and its back but not a great sign (and the site had also got progressively slower and slower over the last week or so ...).
Its Apache, we need to replace it with a decent uwsgi server. It isn't varnish that is the problem.
@rossjones I'm not sure there. I'd rebooted apache and locally on relevant port it was fine but varnish was down. Would that still be apache.
All that said I'm a big +1 on move to gunicorn ...
@rossjones I'm seeing the guru meditation quite a bit. I'm disabling varnish for now and reverting to nginx caching as that did not seem to generate these results (as I mentioned that was why I reverted orginally). Let's catch up on irc and hash out a plan.
I still see 503s with nginx but if it appears to be worse with varnish....
@rossjones hmmm interesting. I think we need to start getting this monitored at the very least. I can request sysadmin team set this up with datahub as notification address. Shall I do this?
Sure.
@rossjones and shall we expedite switch to use gunicorn etc? also are you around right now to chat quickly?
Happy to move over to gunicorn at the end of the week, perhaps we should also move to hydrogen (unless that is 100% ansible managed atm)?
agree on both points. hydrogren is ansible managed but getting this stuff into ansible should be fine.
Turns out it was that the OS had run out of inodes because /tmp was full.
I've just added some related specific tickets to the description for this issue. Given lack of performance issues in last week since we fixed inodes i wonder if we should close this?
This is covered by:
At times today I've seen page load times for front page for logged in user of > 15s (to get html) and > 30s to get html + all assets.
We probably want to assess this systematically and then address.
Possible related actions:
16 (serve static assets directly)
46 migrate datahub to new machine
47 run datahub under gunicorn
Some things I did today
I spent 15m today doing some tweaks
Based on some very crude by eye testing this may have improved things - e.g. front page is now <1s for load (for non-logged in)