microsoft / retina

eBPF distributed networking observability tool for Kubernetes
https://retina.sh
MIT License
2.74k stars 211 forks source link

Degraded node network throughput when retin is installed #655

Open grzesuav opened 2 months ago

grzesuav commented 2 months ago

Describe the bug Along with 1.29 AKS upgrade, retina agens was installed on our nodes, which resulted in degraded network throughput, details in https://github.com/Azure/AKS/issues/4508

To Reproduce See related issue - https://github.com/Azure/AKS/issues/4508 Expected behavior Network throughput not impacted by retina

Screenshots See related issue - https://github.com/Azure/AKS/issues/4508

Platform (please complete the following information):

Additional context Add any other context about the problem here.

vakalapa commented 2 months ago

@grzesuav thanks for reporting the issue, can you give me some more information on if the perf degrade is within same Node, or with traffic between different nodes? We know that intra node has some affect with eBPF programs as there is no noise and it even a small ebpf prog can affect the line rate.

If the perf degrade you saw was in INTER node communication, we can run some tests based on criteria you provide. We can then root cause it to which one or more eBPF progs could be causing this issue.

grzesuav commented 2 months ago

@vakalapa as far I can tell it was between inter node and Azure (blob storage) - however I cannot say fo 100%. We have some internal S3 like app which is using azure blob as backend. The graphs from https://github.com/Azure/AKS/issues/4508 shows network throughput of this app - not sure what more I can provide - as you can see there the overall throughput is very high

grzesuav commented 2 months ago

hi, is there any update on this ?

vakalapa commented 1 month ago

We are working on the performance pipeline for public test results. We were still unable to repro the issue, @ritwikranjan plz tie this issue to your performance pipeline work ?