kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.56k stars 328 forks source link

MeshAccessLog - add support for timeout and retry for TCP backend #8348

Open Constantin07 opened 8 months ago

Constantin07 commented 8 months ago

Description

Hi, When sending Kuma logs via DataDog service in K8s we get sporadically error messages from kuma-sidekar with failures to send logs. It would be nice to add support for configuring timeout & nr. of retries in case of TCP connection failure.

cbugneac-nex commented 8 months ago

Here is an example of exception raised:

2023-11-10T14:47:39.554Z    ERROR   access-log-streamer could not send the log to TCP log destination. Dropping the log {"address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "error": "failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518: write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out", "errorVerbose": "write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out
failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*logSender).send
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/sender.go:33
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).streamAccessLogs
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:163
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).Start.func2
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:72
runtime.goexit
\truntime/asm_amd64.s:1650"}
jakubdyszkiewicz commented 8 months ago

Triage: did you consider using OpenTelemetry? It might support this already (on the agent side)

samm-git commented 7 months ago

@jakubdyszkiewicz but opentelemetry are for metrics, and accessLogStreamer is for logs, isnt it?

jakubdyszkiewicz commented 7 months ago

OpenTelemetry can also be used for logs.

Here we support this in MeshAccessLog https://kuma.io/docs/2.5.x/policies/meshaccesslog/#opentelemetry

And it seems that Datadog agent now has a builtin support for it https://www.datadoghq.com/blog/agent-otlp-log-ingestion/

samm-git commented 7 months ago

@jakubdyszkiewicz thank you for the hint. I will check proposed configuration and will let you know if it works. I was using datadog tcp collector as i think it was recommended in the docs, but i would explore solution proposed above.

cbugneac-nex commented 7 months ago

Thanks @jakubdyszkiewicz for the suggestion, not sure how OpenTelemetry would help here. Does the OpenTelemetry client have support for retry/timeout settings ? Reason being is that if the backend on DataDog is not ready (temporarily) for any reason to accept connections - it would fail similar to TCP backend.

samm-git commented 7 months ago

@jakubdyszkiewicz i see that if oltp logs are specified kuma just configures envoy to send the logs in openmetrics format. But i did not found any retry logic/settings/documentation in envoy about it. Do you know any details?

samm-git commented 7 months ago

@jakubdyszkiewicz I was able to get OTLP logs working, however, it also has own problems:

  1. It is not documented that openmetric logs are using GRPC, was able to find it from the source code and rendered envoy config. This should be mentioned in the doc, as OTLP supports both.
  2. Field openTelemetry.attributes fields do not accept %KUMA_*% placeholders. Moreover, it renders bad config for envoy, so it stops accepting any changes.
  3. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly
  4. I also tried to put to openTelemetry.body object instead of string (just using test: "value" but seems that such config is silently ignored by Kuma and it reverts to default text line

Could you please help with that?

samm-git commented 7 months ago

Actually https://github.com/kumahq/kuma/blob/master/pkg/plugins/policies/meshaccesslog/plugin/xds/configurer.go#L313 missing interpolations

michaelbeaumont commented 7 months ago
  1. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly

@samm-git could you copy/paste an example of what you mean?

samm-git commented 7 months ago

@michaelbeaumont sure.

openTelemetry:
  body: '{"test":"value"}'

would render as escaped

"{\"test\": \"value\"}"
openTelemetry:
  body:
    test: "value"

gets ignored and default is in use (why?)

openTelemetry:
  body:
    - test: value

Actually gets rendered to the array

[{"test":"value"}]

what again prevents datadog from creating an index.

samm-git commented 7 months ago

Also, i did an experiment - i tried to kill Datadog daemonset to see how errors are handled.

From what i see - envoy silently dropped all messages when datadog agent was down. No errors were reported, nothing. So it's even worse compared to the original behavior, at least we had some reference that data is incomplete. @jakubdyszkiewicz ^^

samm-git commented 7 months ago

From what i see in Envoy rendered config - MeshAccessLogs does not configure any retry settings

{
  "log_name": ...,
  "grpc_service": {...},
  "transport_api_version": ...,
  "buffer_flush_interval": {...},
  "buffer_size_bytes": {...},
  "filter_state_objects_to_log": [],
  "grpc_stream_retry_policy": {...},
  "custom_tags": []
}

Most of the things related to retry and log flush are configured via buffer_size_bytes and grpc_stream_retry_policy. As we have only default values which are very small, logs likely will be lost.

jakubdyszkiewicz commented 7 months ago

Hey, I did some investigation on the "body" and our docs are lacking, or rather examples.

Our docs says

Body is a raw string or an OTLP any value as described at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-body It can contain placeholders available on https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/accesslog/usage#command-operators_

This body field is field from OTEL and it's quite complex. If you want to do key value you can do this

apiVersion: kuma.io/v1alpha1
kind: MeshAccessLog
metadata:
  name: default
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh
  from:
    - targetRef:
        kind: Mesh
      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

KUMA_MESH is then interpolated just fine.

We should also do a better job at failing when the format is not right instead of fallback to a default. I agree that it's confusing.

When it comes to retries. I think grpc_stream_retry_policy could help. Let my try to compose MeshProxyPatch to configure this. If this becomes useful we could incorporate this into MeshAccessLog policy

samm-git commented 7 months ago

@jakubdyszkiewicz thank you. May i kindly ask to add interpolation to headers as well? As in fact, we don't need a body at all, these all could be injected as OTLP attributes. Plus it is very confusing to have different rules for headers and for the body

samm-git commented 7 months ago

@jakubdyszkiewicz JFYI - example

      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

works as expected and DataDog is happy with it and indexing the mesh field. I think it would be good to update the documentation with it as it is very confusing atm.

jakubdyszkiewicz commented 7 months ago

Ok, so to test retry you can check this proxypatch

apiVersion: kuma.io/v1alpha1
kind: MeshProxyPatch
metadata:
  name: custom-template-1
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh # use MeshService or MeshSubset if you want to apply this only on specific proxies
  default:
    appendModifications:
      - networkFilter:
          operation: Patch
          match:
            name: envoy.filters.network.http_connection_manager
            # origin: inbound # optional to if you want to add this on incoming listeners
            # listenerName: "inbound:10.42.0.9:3001" # optional if you want to add this to specific listener
          jsonPatches:
            - op: add
              path: /accessLog/0/typedConfig/commonConfig/grpcStreamRetryPolicy
              value:
                numRetries: 5
                retryBackOff:
                  baseInterval: "1s"
                  maxInterval: "10s"

I agree on docs. Created issue https://github.com/kumahq/kuma-website/issues/1538 Interpolation of attributes makes sense https://github.com/kumahq/kuma/issues/8492

Let me know how this retry goes.

bartsmykla commented 7 months ago

I added above example to OpenAPI schema of MeshAccessLog policy: https://github.com/kumahq/kuma/pull/8533/files#diff-ed8e7e0fddb480be4d601ae35bb953cddd53659fb3b9d306b43996a8f322cb38R82

github-actions[bot] commented 4 weeks ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.