official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
270 stars 126 forks source link

Different stale detection to avoid >1 workers in directory. #2082

Closed vdbergh closed 1 week ago

vdbergh commented 1 week ago

We touch the lock file every two seconds and check the age of the lock file when we start the worker. If the lock file is too old then we consider it as stale.

For the actual lock we open the lock file with the flags O_EXCL|O_CREAT. This is atomic and will fail if the lock file already exists.

The logic has been refactored and is contained in a separate package packages.openlock.

Not tested on Windows but it does not use any window specific apis.

vdbergh commented 1 week ago

CI doesn't seem to get any MacOS runners.

Disservin commented 1 week ago

@vdbergh the ci has a big backlog and workers are stuck here https://github.com/official-stockfish/fishtest/actions/runs/9603511710, + some of your ci seems to be stuck too, ci times of 3+ hours

vdbergh commented 1 week ago

Yes some of my older commits had a bug making the tests hang. But I thought they had all been cancelled now.

vdbergh commented 1 week ago

Drafted this since there is still a race condition.

vdbergh commented 1 week ago

The race condition involves three processes attempting to acquire the lock

PID Step
1 detects stale lock file
2 detects stale lock file
2 deletes lock file
3 creates lock file atomically
1 deletes lock file

The result is that process 3 thinks it holds the lock while in reality it is open.

I do not see a solution for this. It seems atomic file system operations (which are ubiquitous) do not really help with handling stale lock files (the goal being that in the common case, where there are no stale lock files, no timeouts are being used).

ppigazzini commented 1 week ago

@vdbergh We have just moved PROD and the net server on a new server (8 threads, 32 GB RAM, 438 GB hard disk space), running with Ubuntu 22.04 and MongoDB 7.

vdbergh commented 1 week ago

@vdbergh We have just moved PROD and the net server on a new server (8 threads, 32 GB RAM, 438 GB hard disk space), running with Ubuntu 22.04 and MongoDB 7.

Great. Although I liked the constrained environment of the previous server :) That makes it more pleasant to look for optimizations...

vdbergh commented 1 week ago

@ppigazzini A question about this PR. The worker uses the following command on Windows:

"powershell (Get-CimInstance` Win32_Process -Filter 'ProcessId = {pid}').CommandLine".format(pid)

However on CI this commands seems to return nothing - presumably because getting the command line of a process requires elevated privileges.

Does it work on a normal Windows where the user has default privileges?

ppigazzini commented 1 week ago

We kept the same configuration and minimalistic philosophy. I simply dropped cpulimit from the scheduled task because we can use 100% of the CPU without the risk to be stopped by the ISP for abuse. The only real bottleneck was the HD space for the pgns with a fleet. Now that constrain is dropped and no fleet will join anytime soon.

ppigazzini commented 1 week ago

Does it work on a normal Windows where the user has default privileges?

I can do some tests. Unfortunately, in my experience powershell command permission is a moving target with Windows versions, so local tests with VM are not really conclusive.

vdbergh commented 1 week ago

Does it work on a normal Windows where the user has default privileges?

I can do some tests. Unfortunately, in my experience powershell command permission is a moving target with Windows versions, so local tests with VM are not really conclusive.

Hmm. This command is what protects the worker (on master) against being started multiple times in the same directory (on Windows).

peregrineshahin commented 1 week ago

(Get-CimInstance Win32_Process -Filter "ProcessId = $pid") on Powershell works for me in a very non-admin-restricted environment.

vdbergh commented 1 week ago

Get-CimInstancenWin32_Process -Filter 'ProcessId = {pid}' on Powershell works for me in a very non-admin-restricted environment.

Yes I know. But (Get-CimInstancenWin32_Process -Filter 'ProcessId = {pid}').CommandLine is what does not work on CI...

vdbergh commented 1 week ago

The point is the (...).CommandLine suffix.

ppigazzini commented 1 week ago

@vdbergh Your cmdline has a wrong backtick (typo?).

I ran successfully the command in a normal cmd prompt, on a clean Windows 11 VM, with a standard user

"powershell (Get-CimInstance Win32_Process -Filter 'ProcessId = 8912').CommandLine"

image

vdbergh commented 1 week ago

Thanks. It's good to hear that it works!

Now I wonder how to deal with this in CI (where the command seems to fail silently)...

vdbergh commented 1 week ago

Actually the command in your example does not have arguments. Could you test it with a running worker (python worker.py)?

ppigazzini commented 1 week ago

Now I wonder how to deal with this in CI (where the command seems to fail silently)...

The culprit could be the Windows Server 2022 used by GitHub as runner.

vdbergh commented 1 week ago

There is also a bug in the worker since it is looking for PID + command in the returned string. But the returned string does not contain the PID...

I think I am just going to look for the string "python" (+PID) in the process list. There will not be many python processes on windows.

ppigazzini commented 1 week ago

"C:/msys64/ucrt64/bin\python3.11.exe" -i worker.py is the print of command after fixing the condition for the pid in the stdout and starting a second worker in the same folder.

______ _     _     _            _                        _
|  ___(_)   | |   | |          | |                      | |
| |_   _ ___| |__ | |_ ___  ___| |_  __      _____  _ __| | _____ _ __
|  _| | / __| '_ \| __/ _ \/ __| __| \ \ /\ / / _ \| '__| |/ / _ \ '__|
| |   | \__ \ | | | ||  __/\__ \ |_   \ V  V / (_) | |  |   <  __/ |
\_|   |_|___/_| |_|\__\___||___/\__|   \_/\_/ \___/|_|  |_|\_\___|_|

Worker started in C:\fishtest-worker-setup-winget\worker ... (PID=2092)
"C:/msys64/ucrt64/bin\python3.11.exe" -i worker.py

*** Worker (PID=2092) stopped! ***
Another worker (PID=10300) is already running in this directory, using the lock file:
C:\fishtest-worker-setup-winget\worker\worker.lock
Traceback (most recent call last):
  File "C:\fishtest-worker-setup-winget\worker\worker.py", line 1706, in <module>
    sys.exit(worker())
SystemExit: 1
>>>
vdbergh commented 1 week ago

Thanks! Still as long as it does not work in CI it is inconvenient (difficult to validate changes)...

vdbergh commented 1 week ago

I will double check if it really doesn't work.

vdbergh commented 1 week ago

Actually it does work. I guess I was confused by the PID bug. Sorry for the noise.