mailcow / mailcow-dockerized

mailcow: dockerized - 🐮 + 🐋 = 💕
https://mailcow.email
GNU General Public License v3.0
8.62k stars 1.16k forks source link

Slow performance from host when mailcow running #3153

Closed entr0p1 closed 4 years ago

entr0p1 commented 4 years ago

Prior to placing the issue, please check following: (fill out each checkbox with a X once done)


Description of the bug: Hi guys, I'm having some interesting performance issues with the VM I run my Mailcow instance on. It may not necessarily be a bug with Mailcow, but it seems to have occurred after an update of either Mailcow or the system (I usually do them at the same time) and I'm wondering if maybe you've seen anything like this.

The web interface is really slow to click through, clients take a very long time to refresh mailboxes (using ActiveSync), and even the VM itself seems to be performing badly (e.g. if I run a "yum update" it will sit at the blinking cursor for a good couple of minutes before it even starts reaching out to servers). The CLI itself is fairly responsive, just any sort of commands that take any "thought".

The VM runs under Hyper-V 2019 and has the following specs: CPU cores: 2 (Intel Xeon X5650) RAM: 8GB Disk: /var is 50GB (15% used) and /opt is 20GB (1% used)

So the strange thing is, I can't pinpoint what the exact cause is. CPU looks fine and load averages are low; 0.00, 0.04, 0.02.

There is plenty of free memory:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        3.0G        3.4G        8.8M        1.1G        4.5G
Swap:          4.0G        203M        3.8G

Disk IO seems to be fine too:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    1.49    0.02    94.37     0.31   125.55     0.02   10.67   10.75    3.14   2.13   0.32
sdb               0.00     0.00    0.03    0.00     1.10     0.22    87.35     0.00    3.61    3.29    6.15   2.66   0.01
sdc               0.02     0.18    2.78    2.08   150.25    27.05    72.93     0.03    5.76    7.50    3.41   1.98   0.96
sdd               0.00     0.00    0.08    0.02     1.43     0.40    36.29     0.00    6.51    6.10    8.48   4.59   0.05
sde               0.16    15.90    1.60    0.36     7.62    65.01    74.24     0.05   25.53    1.47  133.14   1.00   0.19
sdf               0.00     0.00    0.08    0.00     5.23     0.22   125.80     0.00    3.58    3.52    5.06   1.42   0.01
sdg               0.00     0.00    0.04    0.05     2.09     4.55   146.74     0.00    4.28    5.12    3.68   1.85   0.02
sdh               0.00     0.01    0.06    0.17     1.68     1.50    27.42     0.00    3.66    2.13    4.16   2.98   0.07

Disks are laid out as:

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdf      8:80   0    5G  0 disk 
├─sdf1   8:81   0  200M  0 part /boot/efi
└─sdf2   8:82   0  4.8G  0 part /boot
sdd      8:48   0   20G  0 disk 
└─sdd1   8:49   0   20G  0 part /opt
sdb      8:16   0   30G  0 disk 
└─sdb1   8:17   0   30G  0 part /home
sr0     11:0    1 1024M  0 rom  
sdg      8:96   0    8G  0 disk 
└─sdg1   8:97   0    8G  0 part /tmp
sde      8:64   0    4G  0 disk 
└─sde1   8:65   0    4G  0 part [SWAP]
sdc      8:32   0   50G  0 disk 
└─sdc1   8:33   0   50G  0 part /var
sda      8:0    0   40G  0 disk 
└─sda1   8:1    0   40G  0 part /
sdh      8:112  0   15G  0 disk 
├─sdh1   8:113  0   10G  0 part /var/log
└─sdh2   8:114  0    5G  0 part /var/log/audit

If I stop the mailcow processes, everything is snappy and responsive again (commands run quickly, VM performs well). As soon as I start them again, everything is back to a crawl.

Here's where it gets really weird. There are about 20 VMs on this host, all of which are performing really well and don't display any of these symptoms. The host itself is specced with:

My issue is...

Reproduction of said bug: How exactly do you reproduce the bug?

  1. I go to...
  2. And then to...
  3. But once I do...

I have tried or I do... (fill out each checkbox with a X if applicable)

System information

Further information (where applicable):

Question Answer
My operating system Oracle Linux 7.7 (similar to CentOS/RHEL)
Is Apparmor, SELinux or similar active? SELinux (tried permissive mode, no difference. No denies in /var/log/audit/audit.log)
Virtualization technlogy (KVM, VMware, Xen, etc) Hyper-V
Server/VM specifications (Memory, CPU Cores) 8GB, 2 cores
Docker Version (docker version) 19.03.5, build 633a0ea
Docker-Compose Version (docker-compose version) 1.24.1, build 4667896b
Reverse proxy (custom solution) NGINX

Further notes:

Any input or guidance greatly appreciated, thanks!

andryyy commented 4 years ago

Hm, which procs do the high cpu usage then? And when? I only see 0.x. :)

entr0p1 commented 4 years ago

That’s the thing, it’s always practically in a state of idle (Load averages 0.x) and I can’t work out why.

I wish the solution were as simple as throwing more resources at it :)

andryyy commented 4 years ago

Hm. Even when it becodes slow/laggy? That's strange indeed. Does it show a high "wa" (io wait) in top?

entr0p1 commented 4 years ago

Yeah, the slowness is constant but any CPU spikes are usually short-lived and associated with some sort of task (like the ClamAV updates). Here's the heading from top right now:

top - 23:34:06 up  3:43,  2 users,  load average: 0.06, 0.16, 0.17
Tasks: 377 total,   1 running, 278 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.8 us,  2.5 sy,  0.0 ni, 94.1 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem :  7944772 total,  3135608 free,  3454208 used,  1354956 buff/cache
KiB Swap:  4192252 total,  3986428 free,   205824 used.  4550248 avail Mem 
andryyy commented 4 years ago

But the server is fine as long as these spikes are not there?

I could think of a sudden high io wait. :/ ClamAV is heavy on the CPU and disk when reloading signatures.

Can you further monitor it? Did you run update.sh recently?

entr0p1 commented 4 years ago

Nope, the slowness is constant :/ so for an example I've just loaded a random section of the admin interface and it's showing the loading spinner (in screenshot). The second screenshot is of the top command 1 minute later. The interface is still loading, but the VM doesn't seem to be doing a whole lot...

Screenshot 2019-11-18 23 44 07 Screenshot 2019-11-18 23 43 07

andryyy commented 4 years ago

Does it log anything in the browser dev console?

entr0p1 commented 4 years ago

Nothing of interest, just warning me that the XSS header is bad so its being ignored image

entr0p1 commented 4 years ago

I'm wondering if maybe we start turning containers off one-by-one would that help us find a culprit perhaps?

andryyy commented 4 years ago

I have seriously no idea. I thought about it, too. Can you start with netfilter-mailcow?

entr0p1 commented 4 years ago

Okay I tried stopping them all one-by-one and it didn't settle for some reason. Any ideas?

andryyy commented 4 years ago

I doubt it is mailcow then. It didn't stop when you stopped all containers?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.