Open vondele opened 2 years ago
I can try to do it in the morning but I am not sure I will have enough time since I will be away for most of the day. I think for now we can pass the proxy as a global. IMHO globals are not too bad if they are set only once.
Assuming this can be made to work, then we still have a DOS situation when the proxy is down (which is bound to happen sometimes).
So maybe, when --fleet
is True
, before actually requesting a task, the worker should do a dummy github api call, to make sure it can actually access github. If --fleet
is False
then the worker itself does DOS mitigation.
EDIT: It is sufficient to do this at startup of the worker. Since if --fleet
is True
the worker will quit anyway if github, or the proxy, is inaccessible.
EDIT2: Actually the worker already does a github api call at startup, namely rate_limit
. So it --fleet
is True
then the worker should just quit if this call fails.
I don't think workers caused DOS because fishtest task were disabled entirely amid github issues, and server is still down.
It is a DOS issue. The worker requests a task, which allocates resources on the server, and then dies. If this happens at high frequency the server will run out of resources eventually.
PS. I am not saying that this is the root cause why Fishtest is down right now, but it may have contributed to it.
@noobpwnftw
I made a PR #1194 for the worker to use ghproxy. But it does not work. Can you have a look to see if I am using the proxy URL correctly?
Well, in the case of a github failure or any failure, workers will report failure to the server(also in the case of missing reports, there is a timeout for each allocated task), therefore it shouldn't cause infinite resource grabbing except for the amount of logs it may have caused?
Well we also keep tasks even if they fail since this is useful for debugging. But I think event log entries take more space.
Deleted a comment since there wasn't something I hadn't thought of.
Trying again with the opposite conclusion. The recent flooding is really strange.
If the github api call rate_limit
fails then the server code will assign the worker a task from a run it was assigned before recently, assuming said worker already has the binary, so that no network access would be necessary.
If the worker is restarted after an error (assuming --fleet True
) it was assigned no recent runs (since it gets a new unique_key
). So if rate_limit
fails the worker will get no task and will simply quit again.
So it seems that in the case of the flooding rate_limit
must have succeeded, but further down the line the git api calls failed. Quite perverse.
I just tested this theory by causing an exception after the call to rate_limit
def get_rate():
try:
rate = requests.get(
"https://api.github.com/rate_limit", timeout=HTTP_TIMEOUT
).json()["resources"]["core"]
raise Exception("Github failure") # <--------------
except Exception as e:
print("Exception fetching rate_limit:\n", e, sep="", file=sys.stderr)
rate = {"remaining": 0, "limit": 5000}
return rate, False
remaining = rate["remaining"]
print("API call rate limits:", rate)
return rate, remaining < math.sqrt(rate["limit"])
Output
$ python worker.py --fleet true
....
Exception fetching rate_limit:
Github failure # <----------
Verify worker version...
Worker version checked successfully in 0.231998s
Fetch task...
Task requested in 0.177445s
No tasks available at this time, waiting... # <---------
Exiting the worker since fleet==True and an error occurred
Waiting for the heartbeat thread to finish...
Heartbeat stopped
Deleting lock file /home/fishtest/fishtest/worker/worker.lock
The actual situation was probably like this, consider if someone got a 50% chance of successfully accessing github, then, whenever rate_limit
calls failed, no harms done, otherwise, it floods your server, given the amount of workers and how fleet workers were scheduled(looping over fishtest, dbcn, etc) if all other tasks where empty then you'll have a high chance of getting a flooding situation.
Good point! It does not have to be an all or nothing situation.
From the log the github failure issue started at Dec 10 03:03:32 with 4-5 failures/minute, this the progression:
fishtest@tests:~$ grep "Dec 10 08:00:" ddos.txt | wc -l
3
fishtest@tests:~$ grep "Dec 10 10:00:" ddos.txt | wc -l
16
fishtest@tests:~$ grep "Dec 10 12:00:" ddos.txt | wc -l
42
fishtest@tests:~$ grep "Dec 10 14:00:" ddos.txt | wc -l
50
fishtest@tests:~$ grep "Dec 10 16:00:" ddos.txt | wc -l
56
fishtest@tests:~$ grep "Dec 10 18:00:" ddos.txt | wc -l
65
fishtest@tests:~$ grep "Dec 10 20:00:" ddos.txt | wc -l
76
fishtest@tests:~$ grep "Dec 10 22:00:" ddos.txt | wc -l
100
Here is another possible reason for the flooding. failed_task()
invokes del_tasks()
whose implementation is this.
def del_tasks(run):
if "tasks" in run:
run = copy.deepcopy(run)
del run["tasks"]
return run
Because of the deepcopy()
this means that memory has to be allocated for each task (which is immediately freed again). For runs with thousands of tasks, this is perhaps expensive. A quick fix is to do copy()
first, then delete the tasks and finally do deepcopy()
. A more thorough fix is to figure out exactly what fields the event log needs from a run, and only copy those.
I have suggested on discord that we could simulate on dev a failure of github with a slightly modified worker, this should allow for finding out what the cause of the slowdown is (if any). The worker change would be quite trivial, something like https://github.com/vondele/fishtest/commit/a235712b1131bcb69b3bd8728a249b8e5e4b549c
I see. Yes that would be a good idea! We could start with a single worker.
But independently of this I made some of the optimizations I suggested above into a PR #1200 . The deepcopy()
is unnecessarily wasteful and the two other issues are actually my fault (update_task()
does not do buffer(run,True)
if the task is finished, so there is no reason for failed_task()
to do it).
@ppigazzini What do you think of this idea of stress testing the DEV server?
I'm set up to give it a try right now, see https://dfts-0.pigazzini.it/actions (but I won't do it unless ppigazzini is available and has time), but the concept seems to work.
http://www.askasya.com/post/largeembeddedarrays/
Fishtest violates to some extent the design principles of mongodb. I think that if we had a separate task collection, cacheing runs would not be necessary (it would be done by the db since at any time only a small amount of data would be live) and we could easily run many Fishtest instances in parallel.
http://www.askasya.com/post/largeembeddedarrays/
Fishtest violates to some extent the design principles of mongodb. I think that if we had a separate task collection, cacheing runs would not be necessary (it would be done by the db since at any time only a small amount of data would be live) and we could easily run many Fishtest instances in parallel.
It is indeed essential for performance that reads from or writes to MongoDb transfer small amounts of data.
DEV now running like PROD: mongodb 4.2.17 and python 3.8.12 (I was testing mongodb 5.0.5 and python 3.9.9). Please mind that PROD has a custom RAM/swap setup.
$ free -h
total used free shared buff/cache available
Mem: 6.0G 824M 1.5G 1.1M 3.7G 5.2G
Swap: 3.0G 17M 3.0G
$ free -h
total used free shared buff/cache available
Mem: 5.0G 3.1G 591M 2.1M 1.3G 1.9G
Swap: 0B 0B 0B
it might be a good idea to add the option of a github proxy, in light of the problem access github by noob's workers.
https://github.com/glinscott/fishtest/blob/c95857e5ae55018858841cb8952939e2b7e541c2/worker/games.py#L184 would need to insert https://ghproxy.com/ before the github api link, and probably the github rate api needs in worker.py needs to return some reasonable default.