official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 129 forks source link

Server down #423

Closed crossbr closed 4 years ago

crossbr commented 4 years ago

"Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)"

31m059 commented 4 years ago

All pages appear to be down, not just the default page (http://tests.stockfishchess.org/tests), but also Actions (http://tests.stockfishchess.org/actions), Users (http://tests.stockfishchess.org/users), individual tests, etc.

Vizvezdenec commented 4 years ago

it happened 2-3 times in past days btw

31m059 commented 4 years ago

Thankfully, it seems to be online now. :)

vondele commented 4 years ago

server down again..

31m059 commented 4 years ago

@vondele Interestingly, live Elo seems to still be online this time (even though it's not receiving any new games, and therefore never updates): http://tests.stockfishchess.org/html/live_elo.html?5dbf24950ebc5925b64f1965

I wonder if that will help debug the error.

crossbr commented 4 years ago

@ppigazzini any idea what's causing this?

31m059 commented 4 years ago

@vondele @crossbr We appear to be back online.

ppigazzini commented 4 years ago

@crossbr no. Anyway I prefer to use my time to test the new VPS for the fishtest update.

vondele commented 4 years ago

@ppigazzini another server down event.

ppigazzini commented 4 years ago

@vondele rebooted. Just ironed out all the wrinkles on the new OpenVZ7 vps.

Alayan-stk-2 commented 4 years ago

Fishtest down again here, with a failed connection error from my browser.

valera-rozuvan commented 4 years ago
$ ping tests.stockfishchess.org
PING tests.stockfishchess.org (185.83.217.134) 56(84) bytes of data.
^C
--- tests.stockfishchess.org ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 143ms

It appears that server is not responding to ICMP echo requests.

vondele commented 4 years ago

the server was functional for a couple of hours, but seems down now.

ppigazzini commented 4 years ago

@vondele not reachable with ssh, the ISP informed about the VPS migration from OpenVZ6 to OpenVZ7 after November 6.

vondele commented 4 years ago

@ppigazzini online again, but with internal error.

ppigazzini commented 4 years ago

@vondele I'm looking.

ppigazzini commented 4 years ago

@vondele I loaded on the dev-server the latest db dump (07.11.2019 h 14.08 UTC 0, one hour ago), please check if the db is correct/updated.

vondele commented 4 years ago

from what I can see this looks OK.

31m059 commented 4 years ago

@vondele Only the homepage is down on the main server, these pages still seem to work:

Actions (http://tests.stockfishchess.org/actions) Users (http://tests.stockfishchess.org/users) Individual tests (e.g., http://tests.stockfishchess.org/tests/view/5dc274190ebc5904493b16a9)

ppigazzini commented 4 years ago

The problem is that I got this error with the online db:

Starting subprocess with monitor parent
Starting server in PID 597.
serving on http://0.0.0.0:6543
Overview exception: database error: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('5dc408e20ebc5902de59148b')
Overview exception: database error: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('5dc408e20ebc5902de59148b')
Overview exception: database error: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('5dc408e20ebc5902de59148b')
Overview exception: database error: BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('5dc408e20ebc5902de59148b')
...
31m059 commented 4 years ago

So the error is with this test (lowDepthExt2)? http://tests.stockfishchess.org/tests/view/5dc408e20ebc5902de59148b

I am now getting "refused to connect" errors on all pages, but from a cached version of Actions I have locally, that is the most recently-approved test. I can no longer see what it is, but @Vizvezdenec submitted it, perhaps he can describe if there was anything unusual about this test...

(not that there necessarily was, I'm not blaming you @Vizvezdenec!)

Edit: looking at it now on the dev-server (thanks for the link!), nothing seems unusual about this test... https://dfts-0.pigazzini.it/tests/view/5dc408e20ebc5902de59148b

ppigazzini commented 4 years ago

@31m059 I stopped pserve to make a backup of the online db and to load it on the dev server: the live migration of the VPS from OpenVZ6 to OpenVZ7 could have corrupted the db. In this case I will try to load the latest backup.

ppigazzini commented 4 years ago

fishtest is up and running. The online db was fine when loaded on the dev-server, so I deleted the mongo files in /var/db/* and reloaded the dump.

ppigazzini commented 4 years ago

Collateral effect: the png download is working now :)

vondele commented 4 years ago

thanks!

vondele commented 4 years ago

server down again.

valera-rozuvan commented 4 years ago

http://tests.stockfishchess.org/tests/view/5dc408e20ebc5902de59148b

Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)

So now the server responds, but something is wrong.

31m059 commented 4 years ago

@ppigazzini Server completely down again. tests.stockfishchess.org refused to connect. Seems unresponsive to ping also...

valera-rozuvan commented 4 years ago

What's at the core of the downtime? Can we make it restart automatically if a health end-point fails? Maybe move it to k8s - k8s has a built-in mechanism to restart services which break. See https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ .

d3vv commented 4 years ago

this service and database does not state that data should be treated like gold

ppigazzini commented 4 years ago

@31m059 thank you.

d3vv commented 4 years ago

People who are mired in nosql technologies.. They have nowhere to retreat.. They better get on with their politics and talk - "I don't care just what people say.."

ppigazzini commented 4 years ago

@tomtor pserve log

Starting server in PID 746.
serving on http://0.0.0.0:6543
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
Overview exception: 'ascii' codec can't encode character u'\u0394' in position 633: ordinal not in range(128)
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
[u'missing-input-response']
Overview exception: invalid literal for int() with base 10: "0'"
 Overview exception: invalid literal for int() with base 10: "0'"
Overview exception: invalid literal for int() with base 10: "0'"
Overview exception: invalid literal for int() with base 10: '0"'
Overview exception: invalid literal for int() with base 10: '0"'
Overview exception: invalid literal for int() with base 10: '0"'
Overview exception: invalid literal for int() with base 10: '0--'
Overview exception: invalid literal for int() with base 10: '0--'
Overview exception: invalid literal for int() with base 10: '0--'
ppigazzini commented 4 years ago

@d3vv this is an open source project, volunteers and PRs are welcome.

d3vv commented 4 years ago

@ppigazzini kidding? All sources about fishtest-clients, isn't? No one knows how the server works (structure and a hash of the hash)

d3vv commented 4 years ago

@MultiplyByZer01 Can you put it into words w/o emoji? :))

Vizvezdenec commented 4 years ago

Down again it seems, pings are not going through

valera-rozuvan commented 4 years ago

What do you do to make the server work again? Restart the failed process? Can we automate it with a cron job - like try to restart it every hour - if it's down?

tomtor commented 4 years ago

@ppigazzini The server is sometimes very slow on generating the main page. If we look at http://tests.stockfishchess.org/actions we can also see that the "Update user statistics" which used to take always about 30s, sometimes takes many minutes.

Normal time: 19-11-15 07:01:31 (31 seconds) and slow: 19-11-15 10:19:07 (over 3 minutes).

As I understand we had a change of server infrastructure? The IO-speed variance is much larger than in the old setup.

ppigazzini commented 4 years ago

@tomtor correct, the I/O is slower after the migration on the new OpenVZ7 infrastructure.

real 0m4.100s user 0m0.009s sys 0m3.529s


* September 2019 (https://github.com/glinscott/fishtest/issues/391#issuecomment-529411249):
```bash
$ time (dd if=/dev/zero of=/tmp/test bs=64k count=16k > /dev/null; sync)
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1.70122 s, 631 MB/s

real    0m1.710s
user    0m0.004s
sys     0m1.255s

Some facts:

ppigazzini commented 4 years ago

@valera-rozuvan I restart the server by hand. I will reinstall the server with Ubuntu 18.04 and new packages.

vondele commented 4 years ago

I'm not sure it is related, but an observation I made recently is that the number of games played for a test as shown on the main page and on the live_elo pages can be quite different. Sometimes a couple of 100s of games.

tomtor commented 4 years ago

I'm not sure it is related, but an observation I made recently is that the number of games played for a test as shown on the main page and on the live_elo pages can be quite different. Sometimes a couple of 100s of games.

Yes, that is caused by the caching of the main page. The fishtest server will only build one main page at a time and cache that for 2 seconds. So when the main page is requested and its age is > 2s than a new MongoDb query will be started to build the main page and the requesting user has to wait for the query to finish. During this period other users will be served the old cached page which is delivered instantly but has outdated info.

In the current setup the mongodb query can take over 30s (used to be about 2s) and as a result it will already have some outdated info when served.

The individual test pages are served from the various caches i added to fishtest the last 2 years, so they are not impacted by a slow MongoDb.

@ppigazzini I expect adding some more RAM would fix this issue so the recent MongoDb data is in the Linux kernel buffer cache.

tomtor commented 4 years ago

@ppigazzini Main page performance and "update user statistics" is now quite good, did you change something or is the provider infrastructure getting less loaded due to the approaching weekend?

ppigazzini commented 4 years ago

@tomtor no changes by me.

$ time (dd if=/dev/zero of=/tmp/test bs=64k count=64k > /dev/null; sync)
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB) copied, 12.42 s, 346 MB/s

real    0m13.360s
user    0m0.029s
sys     0m8.375s
glinscott commented 4 years ago

It looks like we are overshooting the CPU usage, and the host is shutting down the VPS.

image

glinscott commented 4 years ago

I've rebooted the server, but they will shut it down again if we keep hitting the CPU limits.

ppigazzini commented 4 years ago

@glinscott I already rebooted the server two hours ago. The main page was down? Anyway it seems that only the main page has problem.

crossbr commented 4 years ago

Are the CPU load spikes being caused by the completion of a test / start of a new test cycle, or are they unrelated to that?

ppigazzini commented 4 years ago

@crossbr they should be the hourly backups: mongodb dump, dump compression.