official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

Server down #423

Closed crossbr closed 4 years ago

crossbr commented 4 years ago

"Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)"

snicolet commented 4 years ago

If we suppose that the size of the Mongo database is the cause of the slowdowns, what about splitting the database by months? The usual use of the site would only pull data from the last two months (current and previous: this would ensure at least 30 days of history), and there would be a small checkbox (off by default) on each page to pull the data as today by looping over all the months since the start of the project.

ppigazzini commented 4 years ago

@snicolet @tomtor yesterday wI updated the VPS to Ubuntu18.04 and the new SW stack/configuration.

Some minutes ago mongo was killed for some unknown reasons:

~$ sudo journalctl -u mongod
-- Logs begin at Sun 2019-11-17 00:34:42 CET, end at Sun 2019-11-17 13:41:51 CET. --
Nov 17 00:52:16 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Nov 17 02:09:10 tests.stockfishchess.org systemd[1]: Stopping MongoDB Database Server...
Nov 17 02:09:10 tests.stockfishchess.org systemd[1]: Stopped MongoDB Database Server.
Nov 17 02:09:18 tests.stockfishchess.org systemd[1]: Started MongoDB Database Server.
Nov 17 13:16:23 tests.stockfishchess.org systemd[1]: mongod.service: Main process exited, code=killed, status=9/KILL
Nov 17 13:16:23 tests.stockfishchess.org systemd[1]: mongod.service: Failed with result 'signal'.
snicolet commented 4 years ago

Is it possible that the host has an automatic kill for processes using too much CPU or memory?

ppigazzini commented 4 years ago

@snicolet yes, the official VPS has 0 swap. I rented a little test VPS with that ISP and the swap is 50% RAM. I will ask to @glinscott to open a ticket.

d3vv commented 4 years ago

If anyone has no swap partition or swap-part is too small then it always possible to use a file as swap:

# dd if=/dev/zero of=/swapfile bs=2048 count=1048576
# mkswap /swapfile
# swapon /swapfile

and add to /etc/fstab:

/swapfile none swap sw 0 0

ppigazzini commented 4 years ago

To free some ram I configured the server to use only 1 instance of pserve. @d3vv thank you, I'll try now.

d3vv commented 4 years ago

Swap it is very impotent part into linux for oom-killer actions, so also I used half memory for zswap partitions:

cat /etc/local.d/zram.start

#!/bin/bash

SIZE=512

num_devices=4

until [ $num_devices -eq 0 ]
do
 num_devices=$[$num_devices-1]
 echo $(($SIZE*1024*1024)) > /sys/block/zram$num_devices/disksize
 mkswap /dev/zram$num_devices
 swapon /dev/zram$num_devices -p 10
done

4 cause - 4 CPU - it is best for performance..

cat /proc/swaps

Filename                                Type            Size    Used    Priority
/dev/md124                              partition       33554364        0      -                                                                                                                                                             1
/dev/zram3                              partition       524284  1536    10
/dev/zram2                              partition       524284  1280    10
/dev/zram1                              partition       524284  1280    10
/dev/zram0                              partition       524284  1280    10
d3vv commented 4 years ago

But it needed to set number of zram-devices on start for kernel:

cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.12.12-gentoo root=/dev/md4 ro md=4,/dev/sda4,/dev/sdb4 zswap.enabled=1 zram.num_devices=4

it is for heavy-load servers. zram.num_devices = <number of Real CPU Cores (w/o HT-cores)>

ppigazzini commented 4 years ago

@d3vv after a little googling I found that OpenVZ usually assign a virtual swap that is RAM based (50% of VPS RAM). For OpenVZ it's a bad practice to add a disk/file swap because it thrashes the host disk sub-system. Using only 1 pserve instance has freed 30% RAM, so I keep monitored the VPS waiting the IPS's answer.

d3vv commented 4 years ago

@ppigazzini As I know long time ago it is from virtuozo-settings by admins.. Unfortunately last years I have only practice with XEN and Linux KVM for virtual-instances..

ppigazzini commented 4 years ago

@d3vv I also have little OpenVZ knowledge. In the last months I tested the multi pserve configuration on the dev server (VPS w/ other IPS) that has a swap of 10 GB disk based, so I never hit the memory wall. Also on the production server the pserve and mongod processes have a double RAM size because the great number of workers.

d3vv commented 4 years ago

@ppigazzini Could you explain what is pserver mean? Just in my practice I have used percona mongodb server (best for me):

https://github.com/percona/percona-server-mongodb

But I began to guess that "this pserver" is something else..

tomtor commented 4 years ago

from https://www.bmc.com/blogs/mongodb-memory-usage-and-management/

MongoDB, in its default configuration, will use will use the larger of either 256 MB or ½ of (ram – 1 GB) for its cache size.

That looks OK to me, because we would like to fit the last month of tests in the MongoDb cache. 1 Gb for pserve should be sufficient. It would be nice to know the size of 30 days of tests and match the MongoDb cache size.

If an average slot is 200 bytes and we average 100 slots per test and 50 tests each day: 200 100 50 * 30 => 30 megabyte, so reducing the MongoDb cache (with cacheSizeGB) to eg 512 megabyte should allow querying several months efficiently AND restoring the old multi pserv configuration.

Using only 1 pserve instance has freed 30% RAM

We can raise that later, but I think it is good to reduce it to 1 for now as you did and first lower cacheSizeGB to eg 512 megabyte.

@ppigazzini How much RAM has the current server?

tomtor commented 4 years ago

Could you explain what is pserver mean

That is the Python Pyramid server which implements the fishtest web server

ppigazzini commented 4 years ago

@tomtor 5 GB RAM

tomtor commented 4 years ago

5 GB RAM

@ppigazzini Ah, thanks. That's a lot, so MongoDb will try to take 2 gigabyte ((5-1)/2) and will be killed by the kernel if the pserve processes request their fair share. I would estimate that 512 megabyte would be sufficient for MongoDb.

Currently performance of fishtest is fine BTW, and congratulations on the upgrade to Ubuntu 18.04!

d3vv commented 4 years ago

Using big cacheSizeGB meaning "don't care about unexpected crushes/problems" for performance reason.. In that case you could to use xfs with nobarrier option with have no fear :) In any cases I suggest using swap partition with 25% minimum of RAM on virtual machines too.. It gave a time to be alive if problems persist on main host or guest-host.. zswap also gave a bonus to be online maximum - but it free memory needed.. Also I advise you to pay attention to Percona Solutions..

ppigazzini commented 4 years ago

@tomtor @d3vv the default /etc/mongod.conf has the engines section commented out. I will test on the dev server to avoid a disaster.

# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
#  engine:
#  mmapv1:
#  wiredTiger:
d3vv commented 4 years ago

@ppigazzini yes, it is default configuration for stability reason now.. But it has slow performance.. Also I advice to pay attention to:

https://github.com/DmitryKoterov/cachelrud

d3vv commented 4 years ago

As for nginx I recommend to set:

server_tokens off;

for hide versa and linux distro for security reason..

I suggest to investigate urls which only GET needed, and which GET+POST needed And use limit_except GET { directive..

ppigazzini commented 4 years ago

@tomtor @d3vv the current VPS plan doesn't support swap. Anyway my first priority is to restore the backup process.

d3vv commented 4 years ago

@ppigazzini Once again, I suggest to use swap from file and if u have low memory use low-size zswap... This is the way to "get fool" host-machine to escape fall u into it's virtual memory which equal as main system swap in some situations.. Moreover it it nice if ur VDS/VPS on SSD.. In other way if u used xfs-filesystem u can always to use online defrag-proccess for example:

xfs_fsr /dev/<device>

where "device" = sda1, md1, etc

d3vv commented 4 years ago

I would to say that no need to rely on an honest distribution of resources when it comes to virtual machines, jail chroots or into Linux Containers.. just keep fighting like:

https://www.youtube.com/watch?v=_OvpzForHyU

ppigazzini commented 4 years ago

@tomtor @d3vv set cacheSizeGB: 0.5 and enabled 3 pserve instances (and backup process fixed)

tomtor commented 4 years ago

@ppigazzini For some reason 512MB MongoDb cache performs much worse than 2000MB even when the main page just shows the last 50 tests. The only reason I can think of is that the indices are too big to fit.

tomtor commented 4 years ago

The query for finished runs in rundb.py around line EDIT 229:

c = self.runs.find(q, skip=skip, limit=limit, sort=[('last_updated', DESCENDING)]) 

The sort would need to inspect ALL tests.

@ppigazzini I wonder when that sort is removed, we might still get recent tests first, but in reverse order of creation instead of update. Could you test removing the sort on the dev server?

If that works then we can do the sort on the retrieved 50 tests in Python to speed it up.

ppigazzini commented 4 years ago

@tomtor I'm commuting. Anyway I was able to raise the MongoDB to 1 GB.

ppigazzini commented 4 years ago

After the upgrade to Ubuntu the server seems stable, so I close this issue.

snicolet commented 4 years ago

It is such a joy to see a speedy server again, thanks everybody!

Alayan-stk-2 commented 4 years ago

I'm getting some "504 Gateway Time-out" when trying to access test pages

ppigazzini commented 4 years ago

@Alayan-stk-2 we jumped from MongoDB 2.x to MongoDB 4.x, we are debugging an index compatibility problem https://github.com/glinscott/fishtest/pull/434 that slow down MongoDB.

ppigazzini commented 4 years ago

The VPS went suddenly down. I already wrote to Gary (the VPS owner) to have information.

Vizvezdenec commented 4 years ago

Everything seems to be up now but older tasks seem to have 0 workers assigning to them

Vizvezdenec commented 4 years ago

Yes, can be manually fixed with "resetting" them by, for example, modifying total game number.

Vizvezdenec commented 4 years ago

https://i.imgur.com/oRRF2m3.png this tasks (and all over them) are not assigned any workers even if workers become free, workers get reassigned to currently running/newer tasks.

ppigazzini commented 4 years ago

@Vizvezdenec if the problem is still there please open a new issue.

Alayan-stk-2 commented 4 years ago

Getting "502 Bad Gateway" now on the main page...

ppigazzini commented 4 years ago

@Alayan-stk-2 I stopped fishtest to change some MongoDB indexes not compatible with MongoDB 4.x