official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
270 stars 126 forks source link

Avoid >1 workers in directory using the filelock package. #2079

Closed vdbergh closed 2 weeks ago

vdbergh commented 2 weeks ago

This greatly simplifies the code and also should make it more robust.

vdbergh commented 2 weeks ago

Alas the filelock page seems to require Python >= 3.8 (for the type annotations). Closing for now.

vdbergh commented 2 weeks ago

I included version 3.4.1 of filelock which according to PyPi is the last one which is compatible with Python 3.6.

The use of filelock in this PR is completely trivial so using an older version (but still version 3) seems acceptable to me...

vondele commented 2 weeks ago

might not be super portable though https://github.com/tox-dev/filelock/issues/67

vdbergh commented 2 weeks ago

Hmm. The proposed solution of falling back to SoftFileLock is not acceptable.

The difficulty of finding a cross-platform locking package shows that our implementation in the worker (check after 0.5s if we won the race) is not so unreasonable...

Closing this for now.

vdbergh commented 2 weeks ago

There are some file system operations which are usually atomic (e.g. #2068 ) that could be easily used to implement a cross platform lock. However I cannot think of a race free way for dealing with staleness.

vdbergh commented 2 weeks ago

Of course stale lock files should be rather rare. So if we detect one we can afford to do something naive:

The wait is necessary to avoid a race when two processes discover a stale lock file. Without the wait one of the processes (being slow) may delete the lock file created by the other process.

vdbergh commented 2 weeks ago

Latest version. It works well on Linux.

import atexit
import logging
import os
import threading
import time
from pathlib import Path

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

class Timeout(Exception):
    pass

# These deal with stale lock file detection 
_touch_period = 2.0
_stale_detect = 7.0
_stale_delay = 0.5

# These deal with acquiring locks
_repeat_delay = 0.3

class OpenLock:
    def __init__(self, lock_file, detect_stale=False, timeout=None):
        self.__lock_file = Path(lock_file)
        self.__timeout=timeout
        self.__detect_stale=detect_stale
        self.__lock = threading.Lock()
        self.__acquired = False
        self.__repeat_touch = threading.Thread(target=self.__touch, daemon=True)
        self.__repeat_touch.start()
        logger.debug("Lock created")

    def __touch(self):
        while True:
            if self.__acquired:
                self.__lock_file.touch()
            time.sleep(_touch_period)

    def __is_stale(self):
        try:
            atime = os.path.getatime(self.__lock_file)
        except Exception:
            return False
        if atime < time.time() - _stale_detect:
            return True
        return False

    def __remove_lock_file(self):
        try:
            os.remove(self.__lock_file)
            logger.debug("Lock file removed")
        except Exception:
            pass

    def acquire(self, detect_stale=None, timeout=None):
        with self.__lock:
            if timeout is None:
                timeout = self.__timeout
            if detect_stale is None:
                detect_stale = self.__detect_stale
            wait_time=0
            while True:
                if detect_stale:
                    if self.__is_stale():
                        logger.debug("Removing stale lock file")
                        os.remove(self.__lock_file)
                        time.sleep(_stale_delay)

                try:
                    fd = os.open(self.__lock_file, flags=os.O_CREAT | os.O_EXCL)
                    os.close(fd)
                    atexit.register(self.__remove_lock_file)
                    logger.debug("Lock acquired")
                    self.__acquired = True
                    break
                except Exception as e:
                    if timeout is not None and wait_time>=timeout:
                        raise Timeout("Unable to acquire lock") from None
                    else:
                        wait_time+=_repeat_delay
                        time.sleep(_repeat_delay)

    def release(self):
        with self.__lock:
            self.__acquired = False
            self.__remove_lock_file()
            atexit.unregister(self.__remove_lock_file)
            logger.debug("Lock released")

    def __enter__(self):
        self.acquire()

    def __exit__(self, exc_type, exc_value, traceback):
        self.release()

if __name__ == "__main__":
    with OpenLock("test.lock", detect_stale=True, timeout=0):
        time.sleep(100)
vdbergh commented 2 weeks ago

I made a github repo

https://github.com/vdbergh/openlock