madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Contention issue when triggering a lot of Swarming task in a short period of time #89

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Repro:
echo "print('I\'m fine')" > fine.py
./tools/run_on_bots.py --swarming https://chromium-swarm.appspot.com \
    --isolate-server https://isolateserver.appspot.com --priority 5 \
    --deadline 7200 fine.py

Expected:
Works.

Actual:
Many tasks fail with:
  Tests aborted. AbortRunner() called. Reason: Runner has become stale.

It's pretty bad since it is user visible. This the error is printed at collect 
time and not trigger time, there's nothing the user can do to fix this. Note 
that this issue is not reproducible on the canary master, since the load is not 
high enough.

Original issue reported on code.google.com by maruel@chromium.org on 14 Mar 2014 at 3:11

GoogleCodeExporter commented 9 years ago
If it is relate to load, we should be able to repo on the Canary by just 
running that script a couple times (maybe adding a sleep to the script).

Original comment by csharp@chromium.org on 14 Mar 2014 at 3:19

GoogleCodeExporter commented 9 years ago
I tried reproducing the problem on the Canary server, so it is much less 
disruptive. I added a --repeat flag to run_on_bots.py so I can generate 10x the 
load.
./tools/run_on_bots.py --swarming https://chromium-swarm-dev.appspot.com \
    --isolate-server https://isolateserver-dev.appspot.com --priority 5 \
    --repeat 10 fine.py

Sadly, I was not able to reproduce the AbortRunner() failure, but got a fair 
number of HTTP 503 and this one a few times:

--- CUT HERE ---
  File "/base/data/home/apps/s~chromium-swarm-dev/560-0d2f4af.374373064579888318/components/auth/handler.py", line 100, in dispatch
    identity = method_func(self.request)
  File "/base/data/home/apps/s~chromium-swarm-dev/560-0d2f4af.374373064579888318/components/auth/handler.py", line 268, in oauth_authentication
    client_id = oauth.get_client_id(oauth_scope)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/oauth/oauth_api.py", line 165, in get_client_id
    _maybe_call_get_oauth_user(_scope)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/oauth/oauth_api.py", line 215, in _maybe_call_get_oauth_user
    apiproxy_stub_map.MakeSyncCall('user', 'GetOAuthUser', req, resp)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 94, in MakeSyncCall
    return stubmap.MakeSyncCall(service, call, request, response)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 328, in MakeSyncCall
    rpc.CheckSuccess()
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
    raise self.exception
DeadlineExceededError: The API call user.GetOAuthUser() took too long to 
respond and was cancelled.
--- CUT HERE ---

Original comment by maruel@chromium.org on 14 Mar 2014 at 6:13

GoogleCodeExporter commented 9 years ago
I'm sad that GetOAuthUser API is at same level of reliability as other GAE APIs 
:(

Though I think it might not be a problem in a long term: bots we'll be using 
our own auth implementation (IP whitelist for now), that may be more reliable. 

Original comment by vadimsh@chromium.org on 14 Mar 2014 at 6:18

GoogleCodeExporter commented 9 years ago
Issue chromium:354263 has been merged into this issue.

Original comment by maruel@chromium.org on 20 Mar 2014 at 12:13

GoogleCodeExporter commented 9 years ago
Seems like it's related to the missing index I fixed in 
fa07bcbb4c1a7f02f847422c70a21e3961a9bb35. I had deployed it to the canary 
server but not the prod yet. I just deployed it a few minutes ago, will monitor 
the ereporter2 report in the next hour (which has, btw, been doing error 
reports hourly for a while now)

Original comment by maruel@chromium.org on 20 Mar 2014 at 1:24

GoogleCodeExporter commented 9 years ago
Disabled the cron job on the prod server while I'm debugging the problem in 
redaf98aa0ed876469b5ddef20b421c05b4c9e51e. The problem should not be visible on 
the chromium try server starting now.

Original comment by maruel@chromium.org on 20 Mar 2014 at 3:22

GoogleCodeExporter commented 9 years ago
Issue 93 has been merged into this issue.

Original comment by maruel@chromium.org on 8 Apr 2014 at 4:54

GoogleCodeExporter commented 9 years ago
Issue 52 has been merged into this issue.

Original comment by maruel@chromium.org on 8 Apr 2014 at 7:35

GoogleCodeExporter commented 9 years ago
It's mostly fixed but I'll run a few tests on the prod instance to confirm.

Original comment by maruel@chromium.org on 29 May 2014 at 1:53

GoogleCodeExporter commented 9 years ago
It's not perfect but works well with our current load (~350 bots) and was 
tested with much higher load test.

Original comment by maruel@chromium.org on 5 Jun 2014 at 4:09