Closed noobpwnftw closed 3 years ago
Hi @noobpwnftw , interestingly, I cannot unblock users at the moment, so something worse must be happening. Specifically, I get a waitress error when I click on Pending.
EDIT: I can login normally.
Right, I only can't login with that particular account, the others seemed fine.
@noobpwnftw : can you post the error message you get?
On worker it says:
Fetch task... Incorrect username/password Exception accessing host: Traceback (most recent call last): File "fishtest-master/worker/worker.py", line 92, in worker SystemExit: 1
On website it says Invalid password
Ok, thx. I will get into this in the evening, hope is ok for you.
I just checked, there is not an user ChessDB_CN_WorkerPool in mongoDB.
python << EOF
import os, sys
sys.path.append(os.path.expanduser('~/fishtest/fishtest/fishtest'))
from rundb import RunDb
rdb = RunDb()
print rdb.userdb.get_user('ChessDB_CN_WorkerPool')
EOF
None
I view that there is a pending user, but I haven't the approver right on the production server to check if the user was blocked.
Hi @ppigazzini I have the rights but I get a waitress error when I try to view this.
The account shouldn't disappear though... So probably another disk full occurred?
Hi, I have manually unblocked the new user, and checked that the disk is not full. Can you check whether it works?
Hi @ppigazzini, should I just add you to the approvers? If you don't want for some reasons, ok, but I think it would make sense.
So there are two problems still: 1) Why can't I login to the Pending view 2) Why the original user from @noobpwnftw was missing?
@Stefano80 the new account works now.
Good. One down, two to go.
I also get the waitress error when i try to view /pending
What does the error log show?
Mmhh... cannot find the error logs, we need @ppigazzini (can you then add the information to the very cool overview?)
@Stefano80 I'm not a C++ SF developer, so I'm not a good approver. The logs are in (overview updated): /home/fishtest/mongodb/bin/mongod.out /home/fishtest/fishtest/fishtest/nohup.out
Unfortunately my first automatic action was to reboot the VPS and at the moment there is not log rotation, after a service restart or server reboot the logs are overwritten. I try to change the startup scripts in order to save save the old log.
Anyway here are the new logs: mongod.zip nohup.zip
Thx @ppigazzini ! I triggered the waitress error some times and I repeatedly checked the mongo log. The only thing which looks like some culprit could be this
Sat Jul 14 22:04:46.415 [TTLMonitor] ERROR: key for ttl index can only have 1 field
Sat Jul 14 22:04:46.415 [TTLMonitor] ERROR: key for ttl index can only have 1 field
Sat Jul 14 22:04:46.415 [TTLMonitor] ERROR: key for ttl index can only have 1 field
Does it make sense to anybody?
What does nohup.out show?
Nothing.
Workers got "no tasks available" very often, I'm running only 30 of them.
Also my username is deleted ...
This is the log of one of my workers. It was running fine, then an error during the test and, when the batch of 250 games was finished, the error with the login.
Started game 65 of 250 (New-26e3bd6 vs Base-a6fa6a9)
Finished game 63 (New-26e3bd6 vs Base-a6fa6a9): 1/2-1/2 {Draw by adjudication}
Score of New-26e3bd6 vs Base-a6fa6a9: 10 - 15 - 38 [0.460] 63
Task updated successfully in 0.132186s
Started game 66 of 250 (Base-a6fa6a9 vs New-26e3bd6)
Finished game 64 (Base-a6fa6a9 vs New-26e3bd6): 0-1 {Black wins by adjudication}
Score of New-26e3bd6 vs Base-a6fa6a9: 11 - 15 - 38 [0.469] 64
Task updated successfully in 0.141231s
Exception from calling update_task:
Traceback (most recent call last):
File "/home/fishtest/fishtest-master/worker/games.py", line 265, in run_game
if not req['task_alive']:
KeyError: 'task_alive'
Started game 67 of 250 (New-26e3bd6 vs Base-a6fa6a9)
Finished game 65 (New-26e3bd6 vs Base-a6fa6a9): 1/2-1/2 {Draw by insufficient mating material}
Score of New-26e3bd6 vs Base-a6fa6a9: 11 - 15 - 39 [0.469] 65
Task updated successfully in 0.092387s
Exception from calling update_task:
Traceback (most recent call last):
File "/home/fishtest/fishtest-master/worker/games.py", line 265, in run_game
if not req['task_alive']:
KeyError: 'task_alive'
....
Finished game 248 (Base-a6fa6a9 vs New-26e3bd6): 1/2-1/2 {Draw by insufficient mating material}
Score of New-26e3bd6 vs Base-a6fa6a9: 43 - 46 - 160 [0.494] 249
Task updated successfully in 0.105444s
Exception from calling update_task:
Traceback (most recent call last):
File "/home/fishtest/fishtest-master/worker/games.py", line 265, in run_game
if not req['task_alive']:
KeyError: 'task_alive'
Finished game 250 (Base-a6fa6a9 vs New-26e3bd6): 1-0 {White wins by adjudication}
Score of New-26e3bd6 vs Base-a6fa6a9: 43 - 47 - 160 [0.492] 250
Task updated successfully in 0.109892s
Exception from calling update_task:
Traceback (most recent call last):
File "/home/fishtest/fishtest-master/worker/games.py", line 265, in run_game
if not req['task_alive']:
KeyError: 'task_alive'
Elo difference: -5.56 +/- 25.85
Finished match
Finished match cleanly
Note: <type 'exceptions.OSError'> killing the process pid: 14895, possibly already terminated
Wait 10 seconds before upload of PGN...
Uploading compressed PGN of 363344 bytes
Task exited
Fetch task...
Incorrect username/password
Exception accessing host:
Traceback (most recent call last):
File "fishtest-master/worker/worker.py", line 92, in worker
SystemExit: 1
Fetch task...
Incorrect username/password
Exception accessing host:
Traceback (most recent call last):
File "fishtest-master/worker/worker.py", line 92, in worker
SystemExit: 1
update_users.py
exits with this error:
[fishtest@tests ~]$ ~/fishtest/fishtest/utils/update_users.py
Traceback (most recent call last):
File "/home/fishtest/fishtest/fishtest/utils/update_users.py", line 104, in <module>
main()
File "/home/fishtest/fishtest/fishtest/utils/update_users.py", line 101, in main
update_users()
File "/home/fishtest/fishtest/fishtest/utils/update_users.py", line 72, in update_users
process_run(run, info)
File "/home/fishtest/fishtest/fishtest/utils/update_users.py", line 33, in process_run
info[username]['last_updated'] = task['last_updated']
KeyError: u'ChessDB_CN_WorkerPool'
Ok, so if it isn't some security breach, shouldn't we just upgrade MongoDB and restore the database from a backup?
I don't think there can be some kind of bug in fishtest that deletes the account, since it's been running fine for so long. The only variable would be the increasing size of database that may eventually trigger some weird bug of ancient MongoDB installation.
Also my new account still has its machine limit, probably someone can help with raising it. The VMs I set up last time is still there, might be possible act as test servers.
/pending and update_users.py crash because ChessDB_CN_WorkerPool is in user_cache but not in users. This should never happen and there is no code in fishtest server that deletes user entries.
Creating a new ChessDB_CN_WorkerPool account (we want to keep the statistics in user_cache) could fix this unless other active accounts are also gone.
The only code which DOES delete users (idle ones) is scavenge.py so if we rule out a mongodb issue then we should inspect that code closely.
I think I found a possible cause. Under heavy load mongodb queries return incomplete results. So if an active user is dropped from the result list then scavenge.py will remove this user.
In this case ChessDB_CN_WorkerPool probably because it had very frequent updates.
Commenting out scavenge_users() from scavenge.py is a short term fix.
It could be the cause, but that doesn't explain why @ppigazzini 's account disappeared too.
@noobpwnftw raised the limit to 200 machine for your new account. @tomtor scavenge_users() commented out. Also my account "ctoks" is gone.
As long as we don't sync user and user_cache, /pending crashes and user CPU contributions are not recorded.
Usernames with '_' are no longer allowed. The most simple action would be to add @ppigazzini account in the normal way (just use the old names) and rename ChessDB_CN_WorkerPool to ChessDBCN by hand in the user_cache database?
I'm wondering if the simple solution to restore the deleted users could be mongorestore dump/fishtest_new/users.bson
from the backup of 13.07.2018.
From the documentation "mongorestore performs inserts only and does not perform updates. That is, if restoring documents to an existing database and collection and existing documents have the same value _id field as the to-be-restored documents, mongorestore will not overwrite those documents."
@ppigazzini Restoring just users sounds good
Hi @ppigazzini , I agree with @tomtor , from the documentation is what we need.
Users successfully restored:
$ mongorestore dump/fishtest_new/users.bson
connected to: 127.0.0.1
Sun Jul 15 21:07:26.024 dump/fishtest_new/users.bson
Sun Jul 15 21:07:26.024 going into namespace [fishtest_new.users]
Sun Jul 15 21:07:26.488 warning: Restoring to fishtest_new.users without dropping. Restored data will be inserted without raising errors; check your server log
1234 objects found
Sun Jul 15 21:07:26.493 Creating index: { key: { _id: 1 }, ns: "fishtest_new.users", name: "_id_" }
Sun Jul 15 21:07:26.593 Creating index: { key: { username: 1 }, unique: true, ns: "fishtest_new.users", name: "username_1" }
There are 5 users inactive from 1 day, candidate to be the users deleted.
Thx. Interestingly, it solved also the waitress error on the pending user page, fixing the last of the problems :-)
Should we close this issue or keep it open until we confirm somehow that the explanation proposed by @tomtor is the correct one?
@Stefano80 I think that all the code that check the users list was exiting with an error as update_users.py
@Stefano80 Please keep open the issue. If the @tomtor hypothesis is correct, we need to fix the scavenge_users().
Restore looks good, thanks @ppigazzini
Indeed we need to fix scavenge.py at a later date so let's keep this open.
Fast fix for the backup of the logs after a reboot
#!/bin/sh
DIR=/home/fishtest/logs
STAMP=$(date +%Y%m%d%H%M%S --utc -d '1 hour')
cd /home/fishtest
(
cd mongodb/bin
cp mongod.out $DIR/mongod.out.$STAMP
nohup ./mongod > mongod.out &
)
sleep 5
(
cd fishtest/fishtest
cp nohup.out $DIR/nohup.out.$STAMP
./start.sh
)
@tomtor the official server is running with scavenge_users() comment out by hand in scavenge.py. A new setup of the fishtest server will run scavenge_users(). Perhaps is safer to open a PR to comment out scavenge_users().
@ppigazzini I now know the root cause:
update_users.py will empty the user_cache for a small period when it runs.
If scavenge.py runs at the same time then users will deleted because they are considered not active.
We should add the scavenge_users functionality to update_users to fix this at a later time...
@tomtor we should also run the maintenance scripts in a more robust way. On the dev server (w/ a lower IO) a new scheduled instance of update_users.py was started while the previous one was still running: perhaps not dangerous for the DB integrity but could bring down the server performance.
I also can't login, I tried making another account but that doesn't work either
Worker version 59 connecting to http://tests.stockfishchess.org:80 Incorrect username/password Exception accessing host: Traceback (most recent call last): File "E:\fishtest\Fishtest-Worker-Portable\fishtest\worker-CPU0\worker.py", line 62, in worker sys.exit(1) SystemExit: 1
@kelvinwop You have to be unlocked. Login should work now...
For some reason the ChessDB_CN_WorkerPool login no longer works. Maybe someone can check if there is a bug or the password has indeed been changed(not by me).
EDIT: I can see a new restriction may have been applied, thus I have made a new account ChessDBCN which follows the new rules, please approve and raise machine limit, thanks!