Open dwilliams782 opened 3 months ago
I took an extraction of a single pod's worth of adding / removing logs to a single service, e.g.:
then passed it through a noddy little .py script to build up a collection of the IP addresses:
import json
import re
from datetime import datetime
addr_list = []
with open('<logs-file.json>') as data_file:
data = json.load(data_file)
for v in reversed(data):
log_search = re.search('(Adding|Removing) endpoint addr=(\d+\.\d+\.\d+\.\d+)',v["line"])
timestamp = datetime.utcfromtimestamp((int(v["timestamp"]) / 1_000_000_000)).strftime('%Y-%m-%d %H:%M:%S')
action = log_search.group(1)
addr = log_search.group(2)
print(v["line"])
if action == "Adding" and addr not in addr_list:
print("Adding {0}".format(addr))
addr_list.append(addr)
elif action == "Removing":
print("Removing {0}".format(addr))
addr_list.remove(addr)
print("{0}: {1}: {2}".format(timestamp, len(addr_list), addr_list))
and I noticed two things:
There are a lot of repeat logs of adding the same IP addresses in. I see this reflected in outbound_http_balancer_endpoints
metric as the number sometimes goes to 0 - why?
By the end of the script, we ended up with about 3x the amount of IPs that actually existed for that service, and the number never aligns with the value of outbound_http_balancer_endpoints
. I'll validate this again today but from my initial testing, these logs aren't all that accurate?
What version are you on? The outbound_http_balancer_endpoints
metrics was broken and got fixed in edge-24.5.1 (Ref linkerd/linkerd2-proxy#2928). Granted that doesn't explain the repeated IP entries in the logs; I'll discuss this with the team.
One possible stop-gap for appeasing the logs is adding linkerd_pool_p2c=error
to the proxy log setting.
Thanks Alejandro - we're on 2.15.3-enterprise, I don't know whether that fix is in there or not.
I was unable to reproduce the erroneous logs starting fresh in a clean environment, so I think it is actually related to turning HAZL off and on. I'll submit a proper Buoyant support ticket when I have good repro steps on that one.
Ok the metrics fix is already included in the version you're on. Looking forward to the support ticket with the repro steps :-)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
What problem are you trying to solve?
I'm finding the following logs to be incredibly verbose (we have deployments with >100 pods for example constantly rotating with HPA) and masking other more interesting logs. These may be better suited to debug?
How should the problem be solved?
Move logs to debug level from info.
Any alternatives you've considered?
n/a
How would users interact with this feature?
No response
Would you like to work on this feature?
None