omega8cc / boa

Barracuda Octopus Aegir 5.4.0
https://omega8.cc/compare
394 stars 75 forks source link

502 Bad Gateway #33

Closed zzack closed 10 years ago

zzack commented 14 years ago

This may or may not be a barracuda/octopus issue, but then again it may be. I have been running both for about a week with no problems. Randomly today I started receiving a 502 Bad Gateway error from both the barracuda and octopus sites, and all sites managed by the aegir installs.

Rebooting the server fixed the issue, but doing a google search for "aegir 502 bad gateway" shows a number of barracuda users are getting this error (although not with exactly the same symptoms).

Anyone else experiencing this issue?

omega8cc commented 14 years ago

502 bad gateway is a sign of problems with php-fpm, or rather with timeouts, and next mysql timeouts. To investigate what causes that, please monitor server logs:

$ tail -f /var/log/php/php-fpm-slow.log

$ tail -f /var/log/php/error_log

$ tail -f /var/log/syslog

Check also available memory and swap to find if you need to tune your php-fpm and mysql configuration, especially when you are running it on the server with less than 1 GB of RAM.

zzack commented 13 years ago

Thanks! If it happens again I will check those logs, and report back if I find anything that might be generally useful to people.

omega8cc commented 13 years ago

Please leave issues open so others can easier find it. We will have better issue tracker, but for now we are using github native system.

boztek commented 13 years ago

A visit to the drupal 7 beta's module page is resulting in a 502 and the following error in syslog:

mysqld: 101013 17:20:55 [Warning] Aborted connection 71 to db: 'd7b1facepalmin' user: 'd7b1facepalmin' host: '69.164.198.158' (Got an error reading communication packets)

I am running on a 768MB linode.

boztek commented 13 years ago

Restarting php-fpm allows the page to display once before failing again so perhaps I should be adjusting php-fpm? Any tips?

boztek commented 13 years ago

For completion here is the php-fpm-error.log:

Oct 13 17:44:22.624080 [WARNING] fpm_stdio_child_said(), line 167: child 20473 (pool default) said into stderr: "[Wed Oct 13 17:44:22 2010] [apc-error] Cannot redeclare class databasetasks in /var/aegir/platforms/drupal-7.0-beta1/modules/system/system.admin.inc on line 790."
Oct 13 17:44:22.630600 [NOTICE] fpm_got_signal(), line 48: received SIGCHLD
Oct 13 17:44:22.630706 [WARNING] fpm_children_bury(), line 215: child 20473 (pool default) exited with code 2 after 662.760568 seconds from start
Oct 13 17:44:22.632835 [NOTICE] fpm_children_make(), line 352: child 21543 (pool default) started
boztek commented 13 years ago

So removing APC from php.ini fixes the problem but I don't know where do go from here.

omega8cc commented 13 years ago

It is a known already, however new issue with beta1. Previous alpha versions worked without any problem. Well, only alpha7 didn't allow to delete the site, but alpha6 worked fine. Thanks for reporting it, I hope it will help to debug it for the next beta.

boztek commented 13 years ago

Sorry - should I be reporting this in an already established d.o issue? I looked there but couldn't find one - or is a barracuda bug?

boztek commented 13 years ago

Oh and I can't delete the site with drupal 7 beta 1 either.

Dropping database d7arighthandface
Drush command could not be completed.
Output from failed command : Fatal error: Call to undefined function cache_get() in /data/disk/octopus/platforms/drupal-7.0-beta1/includes/module.inc on line 622
An error occurred at function : drush_hosting_task
linuxgeneral commented 13 years ago

In addition to memory and swap use, you should also check for excessive Wait-IO. I have also encountered this problem on a vps that is on a host with a lot disk activity.

see also: http://github.com/omega8cc/nginx-for-drupal/issues#issue/84

boztek commented 13 years ago

Thanks for the tip - what do you consider excessive W-IO ? vmstat is definitely reporting some non-zero numbers but not constant:

root@lefthand:~# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 161520  17564 358852    0    0    51    23  161  109  2  1 94  3
 0  0      0 160892  17572 358852    0    0     0     2  931  677  2  0 98  0
 0  0      0 161148  17572 358852    0    0     0   118 1014  735  1  0 98  0
 0  0      0 161512  17580 358852    0    0     0    31  901  683  2  0 98  0
 0  0      0 159288  17588 358856    0    0     1   138 1105  775  2  0 98  0
 0  0      0 159404  17596 358856    0    0     0     3  874  683  1  0 98  0
 0  0      0 159536  17620 358860    0    0     1   191  995  722  1  0 97  1
 0  0      0 158660  17632 358864    0    0     0     9  962  700  2  1 98  0
 0  0      0 159272  17652 358900    0    0     1   214 2014 1084  6  2 91  1
 0  0      0 159264  17660 358904    0    0     0     6  957  692  2  0 98  0
 0  0      0 155304  17700 359580    0    0    63   269 1733 1125  3  1 96  0
 0  0      0 153560  17740 359728    0    0    62    23 1227  864  2  0 95  2
 0  0      0 125660  18152 372512    0    0   531   488 1480  892  3  0 91  6
 0  0      0 125924  18176 372512    0    0     0    42  983  705  2  0 98  0
 0  0      0 125676  18200 372512    0    0    26   100 1051  742  2  0 98  0
 0  0      0 123916  18284 372984    0    0    72    82 1600  919  3  1 95  1
 0  0      0 123676  18320 373016    0    0    15   218 1267  900  2  0 98  0
 0  0      0 123932  18336 373020    0    0     0   138 1123  784  2  0 98  0
 1  0      0 123932  18376 373020    0    0     0   212 1146  807  2  0 92  6
 0  0      0 124056  18392 373032    0    0     1    13 1148  778  2  1 96  2
 0  0      0 107532  18448 373076    0    0    12   389 2293 1214  8  2 89  1
 0  0      0 124024  18464 373080    0    0     1    49 1045  737  2  0 98  0
 0  0      0 123552  18496 373092    0    0     1   298 1194  791  2  1 95  2

Is this likely something that could be tuned a bit or is it a sign that barracuda really can't run on 768MB mem ?

linuxgeneral commented 13 years ago

Your Wait-IO is consistently under 10 and it does not look like your using any swap. Barracuda runs fine on 512. Its the resource intensive drupal installs that may have a problem when server the server is under heavy disk load.

Try another install now and see if you can reproduce the issue with the above load.

boztek commented 13 years ago

You mean do a clean reinstall on the same VPS ?

omega8cc commented 13 years ago
  1. The d7-beta1 issue can be resolved by disabling this option: apc.include_once_override = 0. It is not a Barracuda bug, however, rather something introduced in beta, which breaks when used with this option. There is no need to disable APC.
  2. Regarding inability to delete d7 site it is a known issue, see http://drupal.org/node/931136
linuxgeneral commented 13 years ago

There should no need for a new install of upgrade of Baradcuda

Simply create another site. Sorry for the confusion, Continue to watch your vmstats, the case I pointed to above shows how this is related to problems with tasks that do not show completed and how to manually delete them.

The problem is most likely intermittent and you my need the vmstat results showing high Wait-IO to open a support request with your hosting provider.

Here is and example vmstat output for a vps with 4 gig of ram on an overloaded sever:

vmstat 1 20

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 34072 51556 321452 522660 0 0 1 15 6 9 1 0 98 1 0 1 34072 51424 321520 523472 0 0 0 1520 1153 1057 4 1 80 15 1 1 34072 50804 321560 523868 0 0 0 440 1691 325 10 1 83 6 0 1 34072 50060 321628 524408 0 0 0 1004 691 619 2 0 81 16 1 0 34072 49564 321684 524812 0 0 0 1420 956 895 3 1 78 18 0 2 34072 48200 321768 525248 0 0 0 1192 744 727 2 1 79 18 1 0 34072 47580 321840 525828 0 0 0 1100 808 695 3 1 80 17 0 1 34072 47704 321936 526456 0 0 0 1732 1689 1640 3 1 77 19 0 2 34072 47580 321952 526492 0 0 0 6808 297 311 0 0 76 24 1 2 34072 47332 321960 526884 0 0 0 9792 536 271 1 0 71 28 0 2 34072 47332 321980 526868 0 0 0 3756 335 263 0 0 74 26 0 5 34072 47456 322000 526876 0 0 0 2060 420 454 1 0 69 30 2 0 34072 46836 322060 527464 0 0 0 1452 1017 985 3 1 79 16 0 1 34072 44728 322148 528248 0 0 0 1552 870 767 3 1 80 15 2 0 34072 45348 322200 528672 0 0 0 992 650 624 2 0 77 20 0 1 34072 43744 322300 529472 0 0 0 1640 1063 980 4 1 81 14 2 0 34072 43868 322356 529896 0 0 0 1092 791 746 3 1 78 19 0 1 34072 43620 322404 530484 0 0 0 948 1320 1202 1 1 75 22 1 1 34072 20964 322452 530532 0 0 0 840 1578 2109 6 4 69 21 0 1 34072 19896 322488 530548 0 0 0 476 382 330 0 0 76 24

boztek commented 13 years ago

Thanks for the help - it doesn't look like swapping is the issue - php is just dying for some reason.

It's gone down again, no sites accessible and hosting-dispatch jobs piling up, eventually that will cause it to swap but at moment it's holding.

Back to the drawing board.

shai-weinstein commented 13 years ago

I think it got something to do with backlog check kernel parameters net.ipv4.tcp_max_syn_backlog and net.core.somaxconn

try to increase them and set value in php-fpm.conf for 1024

I'm still not sure which values are enough/too high, I saw some documents which recommend to give it a value in the thousands (3000) but try to experiment with it.

after setting kernel parameter you probably need restart of the php-fpm and nginx

btw - if its already solved please state that it is solved, and how :)

ghost commented 13 years ago

Hi,

I'm running a Linode 512 VPS

I just ran an upgrade on Barracuda, as i was having trouble cloning, i'm now getting 502 Bad Gateway on all sites inc Aegir

I ran these commands as suggested

$ tail -f /var/log/php/php-fpm-slow.log

just sits there with 'rotate'

$ tail -f /var/log/php/error_log

[11-Apr-2011 09:57:38] PHP Fatal error: PHP Startup: apc_mmap: mmap failed: in Unknown on line 0

$ tail -f /var/log/syslog

Apr 11 10:00:53 aegir kernel: php-cgi[15777]: segfault at f ip 00007f919db9ac10 sp 00007fff43ce83a0 error 6 in libpthread-2.11.1.so[7f919db92000+18000] Apr 11 10:01:07 aegir kernel: php-cgi[15963]: segfault at f ip 00007f9080833c10 sp 00007fffbefb4f70 error 6 in libpthread-2.11.1.so[7f908082b000+18000]

Restarting the server gets it back on line, but it quickly dies again

Any idea's?

omega8cc commented 13 years ago

Current configuration is tuned for 1 GB of RAM minimum (see the README.txt). To use it with lower memory VPS please change the default APC value in the /opt/etc/php.ini from apc.shm_size = 512M to apc.shm_size = 128M and restart it with service php-fpm restart.

ghost commented 13 years ago

That seems to have done the trick - thanks

ghost commented 13 years ago

Ok, seems to be falling over eventually. On Linode graphs i see a big I/O spike and then the CPU maxes out and the server stops responding, I'll upgrade to 1 gig ram + i noticed a couple of errors in the console:

"Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the start(8) utility, e.g. start S20cron
start: Unknown job: S20cron

and also

Starting redis-server: failed "