official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 130 forks source link

github proxy #1192

Open vondele opened 2 years ago

vondele commented 2 years ago

it might be a good idea to add the option of a github proxy, in light of the problem access github by noob's workers.

https://github.com/glinscott/fishtest/blob/c95857e5ae55018858841cb8952939e2b7e541c2/worker/games.py#L184 would need to insert https://ghproxy.com/ before the github api link, and probably the github rate api needs in worker.py needs to return some reasonable default.

vdbergh commented 2 years ago

I can try to do it in the morning but I am not sure I will have enough time since I will be away for most of the day. I think for now we can pass the proxy as a global. IMHO globals are not too bad if they are set only once.

vdbergh commented 2 years ago

Assuming this can be made to work, then we still have a DOS situation when the proxy is down (which is bound to happen sometimes).

So maybe, when --fleet is True, before actually requesting a task, the worker should do a dummy github api call, to make sure it can actually access github. If --fleet is False then the worker itself does DOS mitigation.

EDIT: It is sufficient to do this at startup of the worker. Since if --fleet is True the worker will quit anyway if github, or the proxy, is inaccessible.

EDIT2: Actually the worker already does a github api call at startup, namely rate_limit. So it --fleet is True then the worker should just quit if this call fails.

noobpwnftw commented 2 years ago

I don't think workers caused DOS because fishtest task were disabled entirely amid github issues, and server is still down.

vdbergh commented 2 years ago

It is a DOS issue. The worker requests a task, which allocates resources on the server, and then dies. If this happens at high frequency the server will run out of resources eventually.

vdbergh commented 2 years ago

PS. I am not saying that this is the root cause why Fishtest is down right now, but it may have contributed to it.

vdbergh commented 2 years ago

@noobpwnftw

I made a PR #1194 for the worker to use ghproxy. But it does not work. Can you have a look to see if I am using the proxy URL correctly?

noobpwnftw commented 2 years ago

Well, in the case of a github failure or any failure, workers will report failure to the server(also in the case of missing reports, there is a timeout for each allocated task), therefore it shouldn't cause infinite resource grabbing except for the amount of logs it may have caused?

vdbergh commented 2 years ago

Well we also keep tasks even if they fail since this is useful for debugging. But I think event log entries take more space.

vdbergh commented 2 years ago

Deleted a comment since there wasn't something I hadn't thought of.

vdbergh commented 2 years ago

Trying again with the opposite conclusion. The recent flooding is really strange.

If the github api call rate_limit fails then the server code will assign the worker a task from a run it was assigned before recently, assuming said worker already has the binary, so that no network access would be necessary.

If the worker is restarted after an error (assuming --fleet True) it was assigned no recent runs (since it gets a new unique_key). So if rate_limit fails the worker will get no task and will simply quit again.

So it seems that in the case of the flooding rate_limit must have succeeded, but further down the line the git api calls failed. Quite perverse.

vdbergh commented 2 years ago

I just tested this theory by causing an exception after the call to rate_limit

def get_rate():
    try:
        rate = requests.get(
            "https://api.github.com/rate_limit", timeout=HTTP_TIMEOUT
        ).json()["resources"]["core"]
        raise Exception("Github failure")   #    <--------------
    except Exception as e:
        print("Exception fetching rate_limit:\n", e, sep="", file=sys.stderr)
        rate = {"remaining": 0, "limit": 5000}
        return rate, False
    remaining = rate["remaining"]
    print("API call rate limits:", rate)
    return rate, remaining < math.sqrt(rate["limit"])

Output

$ python worker.py --fleet true
....
Exception fetching rate_limit:
Github failure  # <----------
Verify worker version...
Worker version checked successfully in 0.231998s
Fetch task...
Task requested in 0.177445s
No tasks available at this time, waiting... # <---------
Exiting the worker since fleet==True and an error occurred
Waiting for the heartbeat thread to finish...
Heartbeat stopped
Deleting lock file /home/fishtest/fishtest/worker/worker.lock
noobpwnftw commented 2 years ago

The actual situation was probably like this, consider if someone got a 50% chance of successfully accessing github, then, whenever rate_limit calls failed, no harms done, otherwise, it floods your server, given the amount of workers and how fleet workers were scheduled(looping over fishtest, dbcn, etc) if all other tasks where empty then you'll have a high chance of getting a flooding situation.

vdbergh commented 2 years ago

Good point! It does not have to be an all or nothing situation.

ppigazzini commented 2 years ago

From the log the github failure issue started at Dec 10 03:03:32 with 4-5 failures/minute, this the progression:

fishtest@tests:~$ grep "Dec 10 08:00:" ddos.txt | wc -l
3
fishtest@tests:~$ grep "Dec 10 10:00:" ddos.txt | wc -l
16
fishtest@tests:~$ grep "Dec 10 12:00:" ddos.txt | wc -l
42
fishtest@tests:~$ grep "Dec 10 14:00:" ddos.txt | wc -l
50
fishtest@tests:~$ grep "Dec 10 16:00:" ddos.txt | wc -l
56
fishtest@tests:~$ grep "Dec 10 18:00:" ddos.txt | wc -l
65
fishtest@tests:~$ grep "Dec 10 20:00:" ddos.txt | wc -l
76
fishtest@tests:~$ grep "Dec 10 22:00:" ddos.txt | wc -l
100
vdbergh commented 2 years ago

Here is another possible reason for the flooding. failed_task() invokes del_tasks() whose implementation is this.

def del_tasks(run):
    if "tasks" in run:
        run = copy.deepcopy(run)
        del run["tasks"]
    return run

Because of the deepcopy() this means that memory has to be allocated for each task (which is immediately freed again). For runs with thousands of tasks, this is perhaps expensive. A quick fix is to do copy() first, then delete the tasks and finally do deepcopy(). A more thorough fix is to figure out exactly what fields the event log needs from a run, and only copy those.

vondele commented 2 years ago

I have suggested on discord that we could simulate on dev a failure of github with a slightly modified worker, this should allow for finding out what the cause of the slowdown is (if any). The worker change would be quite trivial, something like https://github.com/vondele/fishtest/commit/a235712b1131bcb69b3bd8728a249b8e5e4b549c

vdbergh commented 2 years ago

I see. Yes that would be a good idea! We could start with a single worker.

But independently of this I made some of the optimizations I suggested above into a PR #1200 . The deepcopy() is unnecessarily wasteful and the two other issues are actually my fault (update_task() does not do buffer(run,True) if the task is finished, so there is no reason for failed_task() to do it).

vdbergh commented 2 years ago

@ppigazzini What do you think of this idea of stress testing the DEV server?

vondele commented 2 years ago

I'm set up to give it a try right now, see https://dfts-0.pigazzini.it/actions (but I won't do it unless ppigazzini is available and has time), but the concept seems to work.

vdbergh commented 2 years ago

http://www.askasya.com/post/largeembeddedarrays/

Fishtest violates to some extent the design principles of mongodb. I think that if we had a separate task collection, cacheing runs would not be necessary (it would be done by the db since at any time only a small amount of data would be live) and we could easily run many Fishtest instances in parallel.

tomtor commented 2 years ago

http://www.askasya.com/post/largeembeddedarrays/

Fishtest violates to some extent the design principles of mongodb. I think that if we had a separate task collection, cacheing runs would not be necessary (it would be done by the db since at any time only a small amount of data would be live) and we could easily run many Fishtest instances in parallel.

It is indeed essential for performance that reads from or writes to MongoDb transfer small amounts of data.

ppigazzini commented 2 years ago

DEV now running like PROD: mongodb 4.2.17 and python 3.8.12 (I was testing mongodb 5.0.5 and python 3.9.9). Please mind that PROD has a custom RAM/swap setup.