Closed Jeffwan closed 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.95 |
area/testing | 0.78 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.74 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.74 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.74 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
This is project kubeflow-ci-deployment
gcloud projects describe kubeflow-ci-deployment
createTime: '2019-02-05T01:54:33.083Z'
lifecycleState: ACTIVE
name: kubeflow-ci-deployment
parent:
id: '1026404669954'
type: folder
projectId: kubeflow-ci-deployment
projectNumber: '29647740582'
Here's a graph of the deployment manager API requests.
The number of requests started spiking this Saturday.
I think we need to figure out what caused that spike.
The number of requests started spiking this Saturday.
I think we need to figure out what caused that spike.
It is possible to check audit logs and find caller?
I also notice one cron job failed because of global name 'cleanup_auto_blueprints' is not defined
. Not sure if that's a problem prevent job cleaning up deploymentmanager stacks
+ python -m kubeflow.testing.cleanup_ci --project=kubeflow-ci-deployment --gc_backend_services=true all --delete_script=/src/kubeflow/kubeflow/scripts/gke/delete_deployment.sh
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/src/kubeflow/testing/py/kubeflow/testing/cleanup_ci.py", line 1462, in <module>
main()
File "/src/kubeflow/testing/py/kubeflow/testing/cleanup_ci.py", line 1448, in main
parser_blueprints.set_defaults(func=cleanup_auto_blueprints)
NameError: global name 'cleanup_auto_blueprints' is not defined
Here's a list of all the deployments deployments.txt
There are 100's of autodeployments. Deployments with the name
kf-vmaster-*
kf-v1-*
So it looks like the auto-deploy jobs are running amok
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
platform/gcp | 0.75 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Here is the log from one of the auto-deploy jobs.
It looks like it is failing for quota and permission errors.
: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'DEPLOYMENTS' exceeded. Limit: 1000.0 ForceSendFields:[] NullFields:[]}" filename="gcp/gcp.go:386"
INFO|2020-05-12T14:04:35|/src/jlewi/testing/py/kubeflow/testing/util.py|72| Error: failed to apply: (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp: (kubeflow.error): Code 400 with message: gcp apply could not update deployment manager Error could not update deployment manager entries; Creating kf-master-0512-58d-storage error(403): FORBIDDEN
jobs.auto-deploy-master-h9jj8-xf7xc.txt
I'm a little unclear about what the root cause is but it looks like we have a cascading failure. Autodeployments are failing. So we end up retrying which eats up quota and causes failures due to quota issues.
I'm temporarily disabling autodeployments until we can add an appropriate fix.
kubectl --context=kf-auto-deploy delete deploy auto-deploy-server
does ci-team
has the permission to this project kubeflow-ci-development
? Is there any job to clean up failed DEPLOYMENTS
?
We have auto-deployments going back to 05/06. Those should have been GC'd by the autodeployer. https://github.com/kubeflow/testing/blob/94d7e3d7c4dfcb54dd397d7efb134da6ad9e2e52/py/kubeflow/testing/auto_deploy/reconciler.py#L428
The reconciler isn't matching the existing deployments which is why they aren't being GC'd.
@Jeffwan you should have access to the kubeflow-ci-deployment
project.
There are two jobs to clean up the autodeployments
Looks like there is a bug in reconciler and that's why they aren't being GC'd. I'm investigating.
It looks like the reconciler is skipping most deployments because they are missing a manifest.
{"message": "Skipping deployment kf-v1-0509-5fd it doesn't have a manifest", "filename": "/home/jlewi/git_kubeflow-testing/py/kubeflow/testing/auto_deploy/reconciler.py", "line": 285, "level": "ERROR", "time": "2020-05-12T16:58:36.071334-07:00", "thread": 140218345301824, "thread_name": "MainThread"}
This is coming from line: https://github.com/kubeflow/testing/blob/94d7e3d7c4dfcb54dd397d7efb134da6ad9e2e52/py/kubeflow/testing/auto_deploy/reconciler.py#L284
I added this in #657 because the reconciler was crashing on some deployments that didn't have a manifest.
Here's the YAML for that deployment. kf-v1-0509-5fd.yaml.txt
It indeed doesn't have field "manifest" (note there are manifest fields in subresources). So it looks like the update failed because of quota errors and then we end up not matching this deployment and so we retry and that causes the cascading failure.
I'm running with kubeflow/testing#661 to cleanup the old deployments. However, it is also being slowed down because we are hitting our quota limits. So it will take a bit of time to recover as we need to wait for our quota to replenish itself.
Redeploying auto-deployer with #661 in order to continually run GC to cleanup failed deployments.
It looks like the quota that we have exceed is "Write queries per day" not queries per 100 seconds. So that might take some time to recover.
I applied for a quota increase we will see whether that is granted otherwise we might need to wait until tomorrow for the quota to recover.
Thanks for quick action. I will have a look as well.
Thanks @Jeffwan
Queries have dropped down significantly
Auto deployments appear to be healthy.
Here's a list of deployments. deployments.txt
Most of the auot-deployments were cleaned up.
We do have some E2E deployments lingering from 2020-05-07. Its not clear why these wouldn't have been GC'd by now.
Looks like the cleanup job is failing. Looks like the error @Jeffwan mentioned above. Here are the logs. cleanup_blueprints.txt
Looks like we are hitting quota limits again
INFO|2020-05-13T21:13:06|/src/jlewi/testing/py/kubeflow/testing/util.py|72| failed to apply: (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp: (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update storage-kubeflow.yaml; Insert deployment error: googleapi: Error 403: Quota exceeded for quota group 'deploymentMutators' and limit 'Write queries per day' of service 'deploymentmanager.googleapis.com' for consumer 'project_number:29647740582'., rateLimitExceeded
We exceeded our daily limit of 1000 queries per day again. I dropped it down to 500 so it least the next time it happens we will have some buffer and won't have to wait to recover.
I think we exceeded our quota because of GC'ing all the old deployments looking back at https://github.com/kubeflow/testing/issues/660#issuecomment-627655418
We had ~1000 deployments. So if we deleted most of those that would amount to O(1000) queries.
Here's an API table with the number of calls for the past 24 hours.
We had ~3.5K delete calls and ~342 insert calls.
Our quota request was granted. We now can set the limit up to a maximum of 2000 write queries per day.
We have currently used 1024 queries per day.
I bumped our limit to 1250 write queries per day. Hopefully that will tests and auto-deployments to start working.
I didn't bump it to 2000 because I wanted to leave a buffer in case we still have bugs causing us to exhaust quota.
Queries are succeeding and it looks like we are being well behaved.
43 delete requests in the last hour 10 insert requests in the last hour.
Thanks @jlewi ! I retried a few jobs for verification
Seems we can resolve the issue now. Verified a few jobs pass the test.
I've confirmed that our DM QPS has resumed to normal levels.
It looks like in the past 24 hours we had
I'm lowering our daily write quota limit from 1250 to 500. This should be enough and will give us buffer so in the event we exceed the quota for a bug we can increase it and not have to wait for quota to recover.
We've seen some issues like below
I am not sure if number of
deploymentmanager
is over the limit or write queries is over the limit. We need more investigations on this and currently tests are blocked by this issue./cc @jlewi