kubeflow / testing

Test infrastructure and tooling for Kubeflow.
Apache License 2.0
63 stars 89 forks source link

Quota exceeded for quota group 'deploymentMutators' - tests are failing #660

Closed Jeffwan closed 4 years ago

Jeffwan commented 4 years ago

We've seen some issues like below

Insert deployment error: googleapi: Error 403: Quota exceeded for quota group 'deploymentMutators' and limit '
Write queries per day' of service 'deploymentmanager.googleapis.com' for consumer 'project_number:29647740582'., rateLi
mitExceeded

I am not sure if number of deploymentmanager is over the limit or write queries is over the limit. We need more investigations on this and currently tests are blocked by this issue.

/cc @jlewi

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.95
area/testing 0.78

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

This is project kubeflow-ci-deployment

gcloud projects describe kubeflow-ci-deployment
createTime: '2019-02-05T01:54:33.083Z'
lifecycleState: ACTIVE
name: kubeflow-ci-deployment
parent:
  id: '1026404669954'
  type: folder
projectId: kubeflow-ci-deployment
projectNumber: '29647740582'
jlewi commented 4 years ago

Here's a graph of the deployment manager API requests. graph

The number of requests started spiking this Saturday.

I think we need to figure out what caused that spike.

Jeffwan commented 4 years ago

The number of requests started spiking this Saturday.

I think we need to figure out what caused that spike.

It is possible to check audit logs and find caller?

Jeffwan commented 4 years ago

I also notice one cron job failed because of global name 'cleanup_auto_blueprints' is not defined. Not sure if that's a problem prevent job cleaning up deploymentmanager stacks

+ python -m kubeflow.testing.cleanup_ci --project=kubeflow-ci-deployment --gc_backend_services=true all --delete_script=/src/kubeflow/kubeflow/scripts/gke/delete_deployment.sh
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/src/kubeflow/testing/py/kubeflow/testing/cleanup_ci.py", line 1462, in <module>
    main()
  File "/src/kubeflow/testing/py/kubeflow/testing/cleanup_ci.py", line 1448, in main
    parser_blueprints.set_defaults(func=cleanup_auto_blueprints)
NameError: global name 'cleanup_auto_blueprints' is not defined
jlewi commented 4 years ago

Here's a list of all the deployments deployments.txt

There are 100's of autodeployments. Deployments with the name

kf-vmaster-*
kf-v1-*

So it looks like the auto-deploy jobs are running amok

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
platform/gcp 0.75

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

Here is the log from one of the auto-deploy jobs.

It looks like it is failing for quota and permission errors.

: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'DEPLOYMENTS' exceeded.  Limit: 1000.0 ForceSendFields:[] NullFields:[]}" filename="gcp/gcp.go:386"
INFO|2020-05-12T14:04:35|/src/jlewi/testing/py/kubeflow/testing/util.py|72| Error: failed to apply:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 400 with message: gcp apply could not update deployment manager Error could not update deployment manager entries; Creating kf-master-0512-58d-storage error(403): FORBIDDEN

jobs.auto-deploy-master-h9jj8-xf7xc.txt

I'm a little unclear about what the root cause is but it looks like we have a cascading failure. Autodeployments are failing. So we end up retrying which eats up quota and causes failures due to quota issues.

jlewi commented 4 years ago

I'm temporarily disabling autodeployments until we can add an appropriate fix.

kubectl --context=kf-auto-deploy delete deploy auto-deploy-server
Jeffwan commented 4 years ago

does ci-team has the permission to this project kubeflow-ci-development? Is there any job to clean up failed DEPLOYMENTS?

jlewi commented 4 years ago

We have auto-deployments going back to 05/06. Those should have been GC'd by the autodeployer. https://github.com/kubeflow/testing/blob/94d7e3d7c4dfcb54dd397d7efb134da6ad9e2e52/py/kubeflow/testing/auto_deploy/reconciler.py#L428

Based on the logs for the [reconciler](https://cloud.console.google.com/logs/viewer?project=kubeflow-ci&folder&organizationId&minLogLevel=0&expandAll=false&interval=PT1H&resource=k8s_container%2Fcluster_name%2Fkubeflow-testing%2Fnamespace_name%2Ftest-pod&timestamp=2020-05-12T23:34:02.680000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22k8s_container%22%0Aresource.labels.cluster_name%3D%22kf-ci-v1%22%0Aresource.labels.namespace_name%3D%22auto-deploy%22%0Alabels.%22k8s-pod%2Fapp%22%3D%22auto-deploy%22%0Aresource.labels.container_name%3D%22reconciler%22&dateRangeStart=2020-05-12T22:36:34.340Z&dateRangeEnd=2020-05-12T23:36:34.340Z&scrollTimestamp=2020-05-12T23:35:51.975501000Z]

The reconciler isn't matching the existing deployments which is why they aren't being GC'd.

@Jeffwan you should have access to the kubeflow-ci-deployment project.

There are two jobs to clean up the autodeployments

  1. reconciler
  2. cleanup_ci.py

Looks like there is a bug in reconciler and that's why they aren't being GC'd. I'm investigating.

jlewi commented 4 years ago

It looks like the reconciler is skipping most deployments because they are missing a manifest.

{"message": "Skipping deployment kf-v1-0509-5fd it doesn't have a manifest", "filename": "/home/jlewi/git_kubeflow-testing/py/kubeflow/testing/auto_deploy/reconciler.py", "line": 285, "level": "ERROR", "time": "2020-05-12T16:58:36.071334-07:00", "thread": 140218345301824, "thread_name": "MainThread"}

This is coming from line: https://github.com/kubeflow/testing/blob/94d7e3d7c4dfcb54dd397d7efb134da6ad9e2e52/py/kubeflow/testing/auto_deploy/reconciler.py#L284

I added this in #657 because the reconciler was crashing on some deployments that didn't have a manifest.

jlewi commented 4 years ago

Here's the YAML for that deployment. kf-v1-0509-5fd.yaml.txt

It indeed doesn't have field "manifest" (note there are manifest fields in subresources). So it looks like the update failed because of quota errors and then we end up not matching this deployment and so we retry and that causes the cascading failure.

jlewi commented 4 years ago

I'm running with kubeflow/testing#661 to cleanup the old deployments. However, it is also being slowed down because we are hitting our quota limits. So it will take a bit of time to recover as we need to wait for our quota to replenish itself.

jlewi commented 4 years ago

Redeploying auto-deployer with #661 in order to continually run GC to cleanup failed deployments.

jlewi commented 4 years ago

It looks like the quota that we have exceed is "Write queries per day" not queries per 100 seconds. So that might take some time to recover.

I applied for a quota increase we will see whether that is granted otherwise we might need to wait until tomorrow for the quota to recover.

Jeffwan commented 4 years ago

Thanks for quick action. I will have a look as well.

jlewi commented 4 years ago

Thanks @Jeffwan

jlewi commented 4 years ago

Queries have dropped down significantly dm_queries

Auto deployments appear to be healthy.

jlewi commented 4 years ago

Here's a list of deployments. deployments.txt

Most of the auot-deployments were cleaned up.

We do have some E2E deployments lingering from 2020-05-07. Its not clear why these wouldn't have been GC'd by now.

jlewi commented 4 years ago

Looks like the cleanup job is failing. Looks like the error @Jeffwan mentioned above. Here are the logs. cleanup_blueprints.txt

jlewi commented 4 years ago

Looks like we are hitting quota limits again

INFO|2020-05-13T21:13:06|/src/jlewi/testing/py/kubeflow/testing/util.py|72| failed to apply:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update storage-kubeflow.yaml; Insert deployment error: googleapi: Error 403: Quota exceeded for quota group 'deploymentMutators' and limit 'Write queries per day' of service 'deploymentmanager.googleapis.com' for consumer 'project_number:29647740582'., rateLimitExceeded

We exceeded our daily limit of 1000 queries per day again. I dropped it down to 500 so it least the next time it happens we will have some buffer and won't have to wait to recover.

jlewi commented 4 years ago

I think we exceeded our quota because of GC'ing all the old deployments looking back at https://github.com/kubeflow/testing/issues/660#issuecomment-627655418

We had ~1000 deployments. So if we deleted most of those that would amount to O(1000) queries.

jlewi commented 4 years ago

Here's an API table with the number of calls for the past 24 hours.

We had ~3.5K delete calls and ~342 insert calls.

DMAPIcalls

jlewi commented 4 years ago

Our quota request was granted. We now can set the limit up to a maximum of 2000 write queries per day.

We have currently used 1024 queries per day.

I bumped our limit to 1250 write queries per day. Hopefully that will tests and auto-deployments to start working.

I didn't bump it to 2000 because I wanted to leave a buffer in case we still have bugs causing us to exhaust quota.

jlewi commented 4 years ago

Queries are succeeding and it looks like we are being well behaved.

43 delete requests in the last hour 10 insert requests in the last hour.

Jeffwan commented 4 years ago

Thanks @jlewi ! I retried a few jobs for verification

Jeffwan commented 4 years ago

Seems we can resolve the issue now. Verified a few jobs pass the test.

jlewi commented 4 years ago

I've confirmed that our DM QPS has resumed to normal levels. dm_queries

It looks like in the past 24 hours we had

I'm lowering our daily write quota limit from 1250 to 500. This should be enough and will give us buffer so in the event we exceed the quota for a bug we can increase it and not have to wait for quota to recover.