linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.6k stars 1.27k forks source link

[FR] - reduce endpoint added/removed logs to debug #12704

Open dwilliams782 opened 3 months ago

dwilliams782 commented 3 months ago

What problem are you trying to solve?

I'm finding the following logs to be incredibly verbose (we have deployments with >100 pods for example constantly rotating with HPA) and masking other more interesting logs. These may be better suited to debug?

INFO ThreadId(02) outbound:proxy{addr=10.224.15.134:5000}:service{ns=<> name=<> port=<>}: linkerd_pool_p2c: Removing endpoint
INFO ThreadId(02) outbound:proxy{addr=10.224.15.134:5000}:service{ns=<> name=<> port=<>}: linkerd_pool_p2c: Adding endpoint addr=<>:5000

How should the problem be solved?

Move logs to debug level from info.

Any alternatives you've considered?

n/a

How would users interact with this feature?

No response

Would you like to work on this feature?

None

dwilliams782 commented 3 months ago

I took an extraction of a single pod's worth of adding / removing logs to a single service, e.g.:

image

then passed it through a noddy little .py script to build up a collection of the IP addresses:

import json
import re
from datetime import datetime

addr_list = []

with open('<logs-file.json>') as data_file:
    data = json.load(data_file)
    for v in reversed(data):
        log_search = re.search('(Adding|Removing) endpoint addr=(\d+\.\d+\.\d+\.\d+)',v["line"])
        timestamp = datetime.utcfromtimestamp((int(v["timestamp"]) / 1_000_000_000)).strftime('%Y-%m-%d %H:%M:%S')
        action = log_search.group(1)
        addr = log_search.group(2)
        print(v["line"])
        if action == "Adding" and addr not in addr_list:
            print("Adding {0}".format(addr))
            addr_list.append(addr)
        elif action == "Removing":
            print("Removing {0}".format(addr))
            addr_list.remove(addr)
        print("{0}: {1}: {2}".format(timestamp, len(addr_list), addr_list))

and I noticed two things:

  1. There are a lot of repeat logs of adding the same IP addresses in. I see this reflected in outbound_http_balancer_endpoints metric as the number sometimes goes to 0 - why?

  2. By the end of the script, we ended up with about 3x the amount of IPs that actually existed for that service, and the number never aligns with the value of outbound_http_balancer_endpoints. I'll validate this again today but from my initial testing, these logs aren't all that accurate?

alpeb commented 3 months ago

What version are you on? The outbound_http_balancer_endpoints metrics was broken and got fixed in edge-24.5.1 (Ref linkerd/linkerd2-proxy#2928). Granted that doesn't explain the repeated IP entries in the logs; I'll discuss this with the team. One possible stop-gap for appeasing the logs is adding linkerd_pool_p2c=error to the proxy log setting.

dwilliams782 commented 3 months ago

Thanks Alejandro - we're on 2.15.3-enterprise, I don't know whether that fix is in there or not.

I was unable to reproduce the erroneous logs starting fresh in a clean environment, so I think it is actually related to turning HAZL off and on. I'll submit a proper Buoyant support ticket when I have good repro steps on that one.

alpeb commented 3 months ago

Ok the metrics fix is already included in the version you're on. Looking forward to the support ticket with the repro steps :-)

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.