skydive-project / skydive

An open source real-time network topology and protocols analyzer
https://skydive.network
Apache License 2.0
2.68k stars 404 forks source link

Errors in agent's netns topology probe on k8s #2365

Open waterjiao opened 3 years ago

waterjiao commented 3 years ago

Hello

I used the master version, and I'm running skydive on k8s v0.19.0.

Env:

host: CentOS7
container: ubuntu20.04

My config is---skydive.yaml---skydive agent ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: skydive-agent
  name: skydive-agent-config
data:
  SKYDIVE_AGENT_TOPOLOGY_PROBES: runc docker
  SKYDIVE_AGENT_LISTEN: 127.0.0.1:8081
  SKYDIVE_AGENT_TOPOLOGY_NETNS_RUN_PATH: /host/run

When I add network namespace on host(Centos7)

# ip netns add net1

Here's the skydive agent log:

2021-03-06T06:56:44.413Z  DEBUG  netns/netns.go:133 (*ProbeHandler).Register host2: Register network namespace: /host/run/net1
2021-03-06T06:56:50.125Z  ERROR  netns/netns.go:307 (*ProbeHandler).start host2: Failed to register namespace: /host/run/net1. All attempts fail:
#1: /host/run/net1 does not seem to be a valid namespace
#2: /host/run/net1 does not seem to be a valid namespace
#3: /host/run/net1 does not seem to be a valid namespace
#4: /host/run/net1 does not seem to be a valid namespace
...

Note the /host/run/net1 does not seem to be a valid namespace errors which means /host/run/net1 's device number is same with /host/run 's device number.

Code is:

if parent := filepath.Dir(path); parent != "" {
    if err := syscall.Stat(parent, &parentStats); err == nil {
        if stats.Dev == parentStats.Dev {
            return fmt.Errorf("%s does not seem to be a valid namespace", path)
        }
    }
}

I use stat command to check this:

in host:

# stat --format=%d /var/run/netns
22
# stat --format=%d /var/run/netns/net1
3

but in agent pod(container):

# stat --format=%d /host/run
22
# stat --format=%d /host/run/net1
22

Note net1's device number is different in host and pod.

It's tricky to debug. Has anyone encountered such a problem before?

Thanks

lebauce commented 3 years ago

Hello. We did encounter such bugs some time ago but it was supposed to be fixed :-)

The reason for the check is the "ip netns" just creates a regular file for the new namespace then quick creates a bind mount from the namespace file in /proc to the regular file.

I'll try to reproduce the problem - pretty tricky to debug indeed - and I'll keep you updated

lebauce commented 3 years ago

Did you use the Kubernetes template in contrib/kubernetes ? It specifies to use hostPID: true

waterjiao commented 3 years ago

Sorry for taking so long to answer.

Yes, I used the Kubernetes template in contrib/kubernetes.

hostPID: true
hostNetwork: true

I did try to config more pod security policy. This is my config:

hostPID: true
hostNetwork: true
hostIPC: true

securityContext:
  privileged: true
  runAsUser: 0
  allowPrivilegeEscalation: true

It didn't work.

I also try on centos(host) with docker container, get the same issue.

env:

host: centos7
container: centos7

When I run docker container:

docker run -it --privileged -v /var/run/netns:/host/run docker.io/centos /bin/bash

When I add network namespace on host(Centos7)

# ip netns add net1

I use stat command to check this:

in host:

# stat --format=%d /var/run/netns
22
# stat --format=%d /var/run/netns/net1
3

but in container:

# stat --format=%d /host/run
22
# stat --format=%d /host/run/net1
22

Note net1's device number is different in host and container.

lebauce commented 3 years ago

@waterjiao Hello. Sorry for the long delay.

On my CentOS 7 VM, I have the same results in the container that in the host. What storage driver are you using ? Is it overlayfs ?