okfn / ckanext-datahub

Plugin for datahub
datahub.io
11 stars 7 forks source link

Improve performance (specifically page load times) #30

Closed rufuspollock closed 10 years ago

rufuspollock commented 11 years ago

At times today I've seen page load times for front page for logged in user of > 15s (to get html) and > 30s to get html + all assets.

We probably want to assess this systematically and then address.

Possible related actions:

I spent 15m today doing some tweaks

Based on some very crude by eye testing this may have improved things - e.g. front page is now <1s for load (for non-logged in)

rufuspollock commented 10 years ago

We're also seeing downtimes ...

rufuspollock commented 10 years ago

A suggestion from someone checking this out: we may have a misconfigured Max Clients directive in apache which results in loads of workers. I think the best method to fix this permanently is to run ckan on gunicorn and put it behind the already running nginx

@rossjones you've also mentioned adding a few indexes on the activity table (this would be better going into core ckan then us doing the upgrade ...)

rossjones commented 10 years ago

Not sure MaxClients (this is the number of concurrent requests, not processes) is relevant as the number of processes/threads we use is defined for the WSGI app (we are using Daemon mode). We're certainly not seeing the 136 processes we'd expect (given current setting).

I've turned off nginx cache and put varnish back again (it's a much better cache than nginx) and the performance is back to reasonable levels (but not really great levels). Next time I do a big deploy I'm going to move over to nginx->gunicorn as Apache's a bit unnecessary.

I've also temporarily blocked BaiduSpider because:

I totally failed to install iftop to check out how, and how much bandwidth is being used. apt won't let me install it.

The indices are already added in core I believe, and only really affect the painfully slow login.

rufuspollock commented 10 years ago

@rossjones great work - i disabled varnish a month or so ago just because i was trying to debug and it seemed we had 2 layers (i.e. nginx and varnish were doing caching - tho' may have misunderstood).

Site is certainly a lot zippier so great work :-)

rufuspollock commented 10 years ago

@rossjones I'm now seeing from varnish cache server:

Error 503 Service Unavailable
Service Unavailable
Guru Meditation:
XID: 1493797728

Which now I recall is why I switched varnish off last time. I rebooted varnish and its back but not a great sign (and the site had also got progressively slower and slower over the last week or so ...).

rossjones commented 10 years ago

Its Apache, we need to replace it with a decent uwsgi server. It isn't varnish that is the problem.

rufuspollock commented 10 years ago

@rossjones I'm not sure there. I'd rebooted apache and locally on relevant port it was fine but varnish was down. Would that still be apache.

All that said I'm a big +1 on move to gunicorn ...

rufuspollock commented 10 years ago

@rossjones I'm seeing the guru meditation quite a bit. I'm disabling varnish for now and reverting to nginx caching as that did not seem to generate these results (as I mentioned that was why I reverted orginally). Let's catch up on irc and hash out a plan.

rossjones commented 10 years ago

I still see 503s with nginx but if it appears to be worse with varnish....

rufuspollock commented 10 years ago

@rossjones hmmm interesting. I think we need to start getting this monitored at the very least. I can request sysadmin team set this up with datahub as notification address. Shall I do this?

rossjones commented 10 years ago

Sure.

rufuspollock commented 10 years ago

@rossjones and shall we expedite switch to use gunicorn etc? also are you around right now to chat quickly?

rossjones commented 10 years ago

Happy to move over to gunicorn at the end of the week, perhaps we should also move to hydrogen (unless that is 100% ansible managed atm)?

rufuspollock commented 10 years ago

agree on both points. hydrogren is ansible managed but getting this stuff into ansible should be fine.

rossjones commented 10 years ago

Turns out it was that the OS had run out of inodes because /tmp was full.

rufuspollock commented 10 years ago

I've just added some related specific tickets to the description for this issue. Given lack of performance issues in last week since we fixed inodes i wonder if we should close this?

rossjones commented 10 years ago

This is covered by: