kubeflow / code-intelligence

ML-Powered Developer Tools, using Kubeflow
https://medium.com/kubeflow/reducing-maintainer-toil-on-kubeflow-with-github-actions-and-machine-learning-f8568374daa1?source=friends_link&sk=ac77444f00c230e7d787edbfb0081918
MIT License
55 stars 21 forks source link

label microservice show 500's contacting metadata server when they first start #88

Open jlewi opened 4 years ago

jlewi commented 4 years ago

See attached logs. Some of the worker pods for the label microservice are returning 500s when they first try to contact the metadata server.

google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\
nb'Could not recursively fetch uri\\n'", <google.auth.transport.requests._Response object at 0x7f8b8b9c2ac8>)

It appears to be able to get credentials though since it is able to verify the pubsub subscription exists. Takes about 4 minutes.

label-bot-worker-5c8967dc7c-rgv9b.pod.logs.txt

Not seeing the same errors reported in kubeflow/kubeflow#4607 in the metadata server logs.

gke-metadata-server.logs.txt

Note I think a lot of the K8s errors in the logs are because the master was temporarily unavailable while it was upgrading.

My cluster is 1.14.9-gke.2

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.89. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

I also observed kaniko jobs launched by skaffold getting stuck. Symptom was kaniko container started but no logs were emitted.

I kicked the node metadata servers

kubectl -n kube-system delete pods -l k8s-app=gke-metadata-server

At that appears to have caused things to start.

I'm running 1.14.9-gke.2

yantriks-edi-bice commented 4 years ago

I'm new to kubeflow and this has caused me all kinds of grief. On top of the other issues with kfctl delete and reapply this really makes things seem unusable. Glad to see it's a Google problem though ;-)