Site time-outs, (very) long loading times

gavanderhoorn commented 5 years ago

I realise this will perhaps be hard to diagnose as it may be local, but since a couple of weeks (2?) loading times of ROS Answers have gone up significantly. Logging in can take something like up to 30 seconds. Loading a question sometimes times out, other times takes a similar amount of time (30 to 50 seconds).

It's most noticeable beginning of the day (CEST). Right now (2pm CEST) it's OK, but not great either.

Looking at status.ros.org seems to show problems with the site around those times as well:

answers_status

evgenyfadeev commented 5 years ago

I've seen CPU load oscillate between ~40% and very high. It could be an effect of having cold cache on the new server and coincidentally high robot traffic. If this persists a solution would be to add CPUs (currently we have two).

The changes I've made compared to the old site:

new code with new version of the framework - probably a bit heavier on CPU
dockerized deployment - could have added some impact
switched from rabbitmq to redis as a message queue - don't know how impactful this might be

gavanderhoorn commented 5 years ago

dockerized deployment - could have added some impact

in my experience there is very little overhead (if any) incurred by deploying using Docker, unless the NAT-based networking is used.

gavanderhoorn commented 5 years ago

It's been pretty bad again:

Screenshot_2019-06-26 ROS Status

@evgenyfadeev: does Askbot have any way to trace performance problems like this?

Edit: it still is pretty bad.

gavanderhoorn commented 5 years ago

I cannot be the only one having these problems, but apparently I am the only one complaining about it:

Screenshot_2019-07-03 ROS Status

evgenyfadeev commented 5 years ago

I've sent an email requesting increase the CPUs on this machine.

gavanderhoorn commented 5 years ago

Hm, did we get 0.5 CPUs now? :0

Screenshot_2019-07-04 ROS Status

evgenyfadeev commented 5 years ago

We did double the CPUs to 4, but that did not have an effect! Now doubled the cache RAM let's see now.

Thank you for your feedback.

gavanderhoorn commented 5 years ago

Seems slightly better, but it's still not what it used to be.

status.ros.org seems to agree:

Screenshot_2019-07-09 ROS Status

gavanderhoorn commented 5 years ago

I don't know how it is for others (so perhaps this is a networking issue on my side), but I'm frequently waiting multiple tenths of seconds for pages to load, edit boxes to appear, etc.

status.ros.org looks like Answers is having some issues as well:

Screenshot_2019-07-10 ROS Status

evgenyfadeev commented 5 years ago

Increased cache ram 50% more.

gavanderhoorn commented 5 years ago

Seemed better at first, but not sure any more:

Screenshot_2019-07-14 ROS Status

gavanderhoorn commented 5 years ago

Situation today:

Screenshot_2019-07-16 ROS Status

Could there be some time-aspect to this? Right now it's really slow (40 seconds wait last page I tried to access). This morning it was OK-ish. Sometimes it's instantaneous.

This is on different machines, different internet connections and different OS.

All times and references CEST.

It's getting rather annoying tbh.

evgenyfadeev commented 5 years ago

Doubled server worker processes, I might repeat this if the load average permits.

If the situation does not improve soon, will move the job queue from redis to rabbitmq to eliminate the possibility of the queue consuming the app cache space.

On Tue, Jul 16, 2019 at 8:43 AM G.A. vd. Hoorn notifications@github.com wrote:

Situation today:

[image: Screenshot_2019-07-16 ROS Status] https://user-images.githubusercontent.com/4550046/61295330-19415f80-a7d8-11e9-8606-21ed92981ff7.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ros-infrastructure/answers.ros.org/issues/196?email_source=notifications&email_token=AAAY5AT6JJC2WNI65ZWQMNLP7W7AVA5CNFSM4H2RB7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2AXEQQ#issuecomment-511799874, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAY5AS27C63E4PWUQ2EVIDP7W7AVANCNFSM4H2RB7LA .

-- Askbot Valparaiso, Chile skype: evgeny-fadeev

gavanderhoorn commented 5 years ago

@evgenyfadeev: is there no way to trace and see where the bottlenecks are?

evgenyfadeev commented 5 years ago

Yes looking into this, will get a uwsgitop to monitor the server process and I did see cache filling up and that's why I've now 4x-ed it.

On Tue, Jul 16, 2019 at 10:24 AM G.A. vd. Hoorn notifications@github.com wrote:

@evgenyfadeev https://github.com/evgenyfadeev: is there no way to trace and see where the bottlenecks are?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ros-infrastructure/answers.ros.org/issues/196?email_source=notifications&email_token=AAAY5ATABCQ6FK7F3EAPMDTP7XK2VA5CNFSM4H2RB7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BAV3I#issuecomment-511838957, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAY5ARNKX5REIDGGOU5FIDP7XK2VANCNFSM4H2RB7LA .

-- Askbot Valparaiso, Chile skype: evgeny-fadeev

gavanderhoorn commented 5 years ago

So far the site has been rather responsive. Almost back to how it was before the upgrade.

Screenshot_2019-07-17 ROS Status

It's only been a day though, so let's see how it holds up.

gavanderhoorn commented 5 years ago

I'm not sure what changed, but yesterday and the day before it was buttery smooth. Today I'm waiting on pages to load again.

gavanderhoorn commented 5 years ago

I've not seen any more service disruptions or site time-outs so far.

@evgenyfadeev: it would seem the changes you've made to the server config have helped.

As such, closing for now. Will re-open if/when we run into any more problems.

ros-infrastructure / answers.ros.org

Site time-outs, (very) long loading times #196