Closed stewartmiles closed 1 year ago
Good catch, this should be fixed in https://github.com/googleapis/python-logging/pull/718. Thanks
@daniel-sanche I don't think your pull fixes this entirely. The DNS lookups for the metadata server can also fail periodically so this requires retry with back off when querying the metadata server. Also, with your proposed pull request this would cause logging to occasionally not be annotated with the correct context without any warnings about the root cause of the failure.
@daniel-sanche did you see my previous comment https://github.com/googleapis/python-logging/issues/710#issuecomment-1410961279 ? I'm pretty sure this isn't fixed in https://github.com/googleapis/python-logging/pull/718
No, sorry, I must have missed your comment!
Can you tell me a bit more about your environment? Are you using GCE? We usually expect the metadata server to be reliably available on GCP environments, so I'm surprised you're encountering these issues.
In general, our policy for the logging libraries is to auto-detect useful metadata where possible, but to do it on a best-effort basis. If a resource isn't passed in by the user, I'd be hesitant to block the thread for multiple retries to try to find the missing metadata. It's usually our policy to fall back to defaults if metadata can't be detected in a reasonable time frame.
If you're blocked and looking for a solution, one option would be to manually construct and pass in a resource object yourself, so that you don't have to rely on the auto-detection loop
import google.cloud.logging
from google.cloud.logging_v2.resource import Resource
client = google.cloud.logging.Client()
resource = Resource(
type="gce_instance",
labels={
"project_id": "my-project",
"instance_id": "my-instance",
"zone": "my=zone",
},
)
client.setup_logging(resource=resource)
We're running on GKE with node clusters on GCE. Our VM base image is gke-12314-gke401-cos-93-16623-295-14-v221129-c-pre though I doubt it relevant it looks like we occasionally see DNS to metadata servers fail or metadata server requests fail. It was happening often enough - i.e every few days - that the GCP logging library would crash and we see Kube pod restarts due to this.
We - like I assume many others - configure logging at the entry point of our application if this takes up to 30s (could be configurable) while waiting for a reasonable connection to the metadata server that is entirely reasonable to me vs. missing log context which could result in logs not triggering alerts correctly.
We have applied the following monkey patch workaround at the moment:
class CloudLoggingSetupError(Exception):
"""Raised if an error occurs while installing GCP logging."""
def _retrieve_metadata_server_with_retry(metadata_key: str,
timeout: float=5) -> Optional[str]:
"""Retrieve the metadata key in the metadata server with a retry.
This is a fork of google.cloud.logging_v2._helpers.retrieve_metadata_server()
that retries if a metadata server query fails.
This works around https://github.com/googleapis/python-logging/issues/710
Args:
metadata_key: Key of the metadata which will form the url.
timeout: Number of seconds to wait for the HTTP request.
Returns:
The value of the metadata key returned by the metadata server or None.
Raises:
CloudLoggingSetupError: If the metadata server fails to respond.
"""
url = google.cloud.logging_v2._helpers.METADATA_URL + metadata_key
# Calculate the retries and backoff factor given a timeout.
backoff_factor = timeout / 2
retries = 0
while backoff_factor > 0.1:
retries += 1
backoff_factor /= 2
session = requests.Session()
session.mount(
parse.urlparse(url).scheme + '://',
adapters.HTTPAdapter(
max_retries=adapters.Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist={
requests.codes.request_timeout,
requests.codes.internal_server_error,
requests.codes.bad_gateway,
requests.codes.service_unavailable,
requests.codes.gateway_timeout,
})))
try:
response = session.get(
url, headers=google.cloud.logging_v2._helpers.METADATA_HEADERS,
timeout=timeout)
if response.status_code == requests.codes.ok:
return response.text
elif response.status_code == requests.codes.not_found:
# Expected response if the metadata value is not available.
return None
raise CloudLoggingSetupError(
f'Failed to read metadata server value {metadata_key}: ({response})')
except requests.exceptions.RequestException as e:
# If the connection failed, log the response
raise CloudLoggingSetupError(
f'Failed to read metadata server value {metadata_key}: {e}') from e
Later in our logging setup code:
# Patch the reference to
# google.cloud.logging_v2._helpers.retrieve_metadata_server
# imported in google.cloud.logging_v2.handlers._monitored_resources).
(google.cloud.logging_v2.handlers._monitored_resources.
retrieve_metadata_server) = _retrieve_metadata_server_with_retry
# FYI: We apply a patch for https://github.com/googleapis/python-logging/issues/713
client = google.cloud.logging.Client()
That's strange, I'm not sure why you'd have to wait 30 seconds for a valid from the metadata server. The metadata server should be exposed as a local loopback call on the same host on GKE. If you're seeing DNS failures, you could try pinging 169.254.169.254
directly as a troubleshooting step. I'd suggest reaching out to GCP support to see why it's not reliable for you
Since you have a workaround, I'm going to go ahead and close this. Feel free to re-open if you need more code samples related to this. Or if anyone else is reading this and encountering this issue, please chime in and I can take another look
FYI: We ended up root causing this in our GKE cluster. By default GKE configures DNS using kube-dns which would be fine, except that when the cluster is updated - potentially every day - all kube-dns pods are torn down, i.e there isn't a rolling update. When kube-dns is torn down (see https://cloud.google.com/kubernetes-engine/docs/how-to/kube-dns#architecture ) DNS stops working across the cluster. Other folks appear to have run into similar issues but the easiest workaround is to install NodeLocal DNSCache daemonset on all pods so that if DNS does die the probably that records like metadata.google.internal
are cached in the NodeLocal DNSCache. There is still a small possibility for metadata.google.internal
to not be cached i.e a pod starts coming up while a GKE cluster is being upgraded in which case this bug will raise its head again.
It's all good for us, as I've mentioned we have monkey patched the Google Cloud Logging client lib to wait until DNS is available again but I really think it would be great to have a mode that waits up to a timeout for the Google metadata service to become available.
Also, I'm not sure why GKE folks think the default configuration with just kube-dns is the right thing to do given the way GKE is updated. It would be great to open a buganizer bug if you can to figure out whether there is a better configuration that will work well out of the box.
Environment details
google-cloud-logging
version: 3.4.0Steps to reproduce
test
when providedmetadata_key
in the set{'instance/id', 'instance/zone'}
andNone
for all other values. This can happen if a metadata server request fails.Code example
test.py
Run
python test.py
Stack trace