googleapis / python-logging

Apache License 2.0
120 stars 52 forks source link

If VM metadata fetch fails in google.cloud.logging.Client.setup_logging(), all logging starts to fail #710

Closed stewartmiles closed 1 year ago

stewartmiles commented 1 year ago

Environment details

Steps to reproduce

  1. Hack retrieve_metadata_server() so that it returns test when provided metadata_key in the set {'instance/id', 'instance/zone'} and None for all other values. This can happen if a metadata server request fails.
  2. Try to log something and see unhandled exceptions from the logging thread.

Code example

test.py

import logging

import google.cloud.logging
import google.cloud.logging_v2.handlers._monitored_resources

# Fake implementation of
# google.cloud.logging_v2._helpers.retrieve_metadata_server()
# that simulates a metadata server error for all keys except
# GCE instance ID and zone.
def _fake_retrieve_metadata_server(metadata_key, timeout=5):
  if metadata_key in {'instance/id', 'instance/zone'}:
    return 'test'
  return None

# Patch the reference to
# google.cloud.logging_v2._helpers.retrieve_metadata_server
# imported in google.cloud.logging_v2.handlers._monitored_resources).
(google.cloud.logging_v2.handlers._monitored_resources.
 retrieve_metadata_server) = _fake_retrieve_metadata_server

client = google.cloud.logging.Client()
client.setup_logging()
logging.info('hello world')

Run python test.py

Stack trace

Program shutting down, attempting to send 1 queued log entries to Cloud Logging...
Waiting up to 5 seconds.
Failed to submit 1 logs.
Traceback (most recent call last):
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 550, in _ConvertFieldValuePair
    self._ConvertMapFieldValue(value, message, field)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 697, in _ConvertMapFieldValue
    getattr(message, field.name)[key_value] = _ConvertScalarFieldValue(
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 733, in _ConvertScalarFieldValue
    if _UNPAIRED_SURROGATE_PATTERN.search(value):
TypeError: expected string or bytes-like object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 580, in _ConvertFieldValuePair
    self.ConvertMessage(value, sub_message)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 476, in ConvertMessage
    self._ConvertFieldValuePair(value, message)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 594, in _ConvertFieldValuePair
    raise ParseError('Failed to parse {0} field: {1}.'.format(name, e))
google.protobuf.json_format.ParseError: Failed to parse labels field: expected string or bytes-like object.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/cloud/logging_v2/handlers/transports/background_thread.py", line 115, in _safely_commit_batch
    batch.commit()
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/cloud/logging_v2/logger.py", line 467, in commit
    client.logging_api.write_entries(
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/cloud/logging_v2/_gapic.py", line 154, in write_entries
    log_entry_pbs = [_log_entry_mapping_to_pb(entry) for entry in entries]
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/cloud/logging_v2/_gapic.py", line 154, in <listcomp>
    log_entry_pbs = [_log_entry_mapping_to_pb(entry) for entry in entries]
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/cloud/logging_v2/_gapic.py", line 564, in _log_entry_mapping_to_pb
    ParseDict(mapping, entry_pb)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 445, in ParseDict
    parser.ConvertMessage(js_dict, message)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 476, in ConvertMessage
    self._ConvertFieldValuePair(value, message)
  File "/home/stewart/src/agentic/service/.venv/lib/python3.8/site-packages/google/protobuf/json_format.py", line 588, in _ConvertFieldValuePair
    raise ParseError('Failed to parse {0} field: {1}.'.format(name, e))
google.protobuf.json_format.ParseError: Failed to parse resource field: Failed to parse labels field: expected string or bytes-like object..
Sent all pending logs.
daniel-sanche commented 1 year ago

Good catch, this should be fixed in https://github.com/googleapis/python-logging/pull/718. Thanks

stewartmiles commented 1 year ago

@daniel-sanche I don't think your pull fixes this entirely. The DNS lookups for the metadata server can also fail periodically so this requires retry with back off when querying the metadata server. Also, with your proposed pull request this would cause logging to occasionally not be annotated with the correct context without any warnings about the root cause of the failure.

stewartmiles commented 1 year ago

@daniel-sanche did you see my previous comment https://github.com/googleapis/python-logging/issues/710#issuecomment-1410961279 ? I'm pretty sure this isn't fixed in https://github.com/googleapis/python-logging/pull/718

daniel-sanche commented 1 year ago

No, sorry, I must have missed your comment!

Can you tell me a bit more about your environment? Are you using GCE? We usually expect the metadata server to be reliably available on GCP environments, so I'm surprised you're encountering these issues.

In general, our policy for the logging libraries is to auto-detect useful metadata where possible, but to do it on a best-effort basis. If a resource isn't passed in by the user, I'd be hesitant to block the thread for multiple retries to try to find the missing metadata. It's usually our policy to fall back to defaults if metadata can't be detected in a reasonable time frame.

If you're blocked and looking for a solution, one option would be to manually construct and pass in a resource object yourself, so that you don't have to rely on the auto-detection loop

import google.cloud.logging
from google.cloud.logging_v2.resource import Resource

client = google.cloud.logging.Client()
resource = Resource(
    type="gce_instance",
    labels={
        "project_id": "my-project",
        "instance_id": "my-instance",
        "zone": "my=zone",
    },
)
client.setup_logging(resource=resource)
stewartmiles commented 1 year ago

We're running on GKE with node clusters on GCE. Our VM base image is gke-12314-gke401-cos-93-16623-295-14-v221129-c-pre though I doubt it relevant it looks like we occasionally see DNS to metadata servers fail or metadata server requests fail. It was happening often enough - i.e every few days - that the GCP logging library would crash and we see Kube pod restarts due to this.

We - like I assume many others - configure logging at the entry point of our application if this takes up to 30s (could be configurable) while waiting for a reasonable connection to the metadata server that is entirely reasonable to me vs. missing log context which could result in logs not triggering alerts correctly.

We have applied the following monkey patch workaround at the moment:

class CloudLoggingSetupError(Exception):
  """Raised if an error occurs while installing GCP logging."""

def _retrieve_metadata_server_with_retry(metadata_key: str,
                                         timeout: float=5) -> Optional[str]:
  """Retrieve the metadata key in the metadata server with a retry.
  This is a fork of google.cloud.logging_v2._helpers.retrieve_metadata_server()
  that retries if a metadata server query fails.
  This works around https://github.com/googleapis/python-logging/issues/710
  Args:
    metadata_key: Key of the metadata which will form the url.
    timeout: Number of seconds to wait for the HTTP request.
  Returns:
    The value of the metadata key returned by the metadata server or None.
  Raises:
    CloudLoggingSetupError: If the metadata server fails to respond.
  """
  url = google.cloud.logging_v2._helpers.METADATA_URL + metadata_key

  # Calculate the retries and backoff factor given a timeout.
  backoff_factor = timeout / 2
  retries = 0
  while backoff_factor > 0.1:
    retries += 1
    backoff_factor /= 2

  session = requests.Session()
  session.mount(
    parse.urlparse(url).scheme + '://',
    adapters.HTTPAdapter(
      max_retries=adapters.Retry(
        total=retries,
        backoff_factor=backoff_factor,
        status_forcelist={
          requests.codes.request_timeout,
          requests.codes.internal_server_error,
          requests.codes.bad_gateway,
          requests.codes.service_unavailable,
          requests.codes.gateway_timeout,
        })))

  try:
    response = session.get(
      url, headers=google.cloud.logging_v2._helpers.METADATA_HEADERS,
      timeout=timeout)

    if response.status_code == requests.codes.ok:
      return response.text
    elif response.status_code == requests.codes.not_found:
      # Expected response if the metadata value is not available.
      return None
    raise CloudLoggingSetupError(
      f'Failed to read metadata server value {metadata_key}: ({response})')

  except requests.exceptions.RequestException as e:
    # If the connection failed, log the response
    raise CloudLoggingSetupError(
      f'Failed to read metadata server value {metadata_key}: {e}') from e

Later in our logging setup code:

  # Patch the reference to
  # google.cloud.logging_v2._helpers.retrieve_metadata_server
  # imported in google.cloud.logging_v2.handlers._monitored_resources).
  (google.cloud.logging_v2.handlers._monitored_resources.
   retrieve_metadata_server) = _retrieve_metadata_server_with_retry
  # FYI: We apply a patch for https://github.com/googleapis/python-logging/issues/713
  client = google.cloud.logging.Client()
daniel-sanche commented 1 year ago

That's strange, I'm not sure why you'd have to wait 30 seconds for a valid from the metadata server. The metadata server should be exposed as a local loopback call on the same host on GKE. If you're seeing DNS failures, you could try pinging 169.254.169.254 directly as a troubleshooting step. I'd suggest reaching out to GCP support to see why it's not reliable for you

Since you have a workaround, I'm going to go ahead and close this. Feel free to re-open if you need more code samples related to this. Or if anyone else is reading this and encountering this issue, please chime in and I can take another look

stewartmiles commented 1 year ago

FYI: We ended up root causing this in our GKE cluster. By default GKE configures DNS using kube-dns which would be fine, except that when the cluster is updated - potentially every day - all kube-dns pods are torn down, i.e there isn't a rolling update. When kube-dns is torn down (see https://cloud.google.com/kubernetes-engine/docs/how-to/kube-dns#architecture ) DNS stops working across the cluster. Other folks appear to have run into similar issues but the easiest workaround is to install NodeLocal DNSCache daemonset on all pods so that if DNS does die the probably that records like metadata.google.internal are cached in the NodeLocal DNSCache. There is still a small possibility for metadata.google.internal to not be cached i.e a pod starts coming up while a GKE cluster is being upgraded in which case this bug will raise its head again.

It's all good for us, as I've mentioned we have monkey patched the Google Cloud Logging client lib to wait until DNS is available again but I really think it would be great to have a mode that waits up to a timeout for the Google metadata service to become available.

Also, I'm not sure why GKE folks think the default configuration with just kube-dns is the right thing to do given the way GKE is updated. It would be great to open a buganizer bug if you can to figure out whether there is a better configuration that will work well out of the box.