srl-labs / containerlab

container-based networking labs
https://containerlab.dev
BSD 3-Clause "New" or "Revised" License
1.52k stars 261 forks source link

Delete host interface when container is destroyed #842

Closed bjmeuer closed 2 years ago

bjmeuer commented 2 years ago

When a lab is destroyed it takes about 2 minutes until the host interfaces of a container are removed from the containerlab host. The removal of the host interfaces of a container should be done immediately with the destroy of the container.

Example interfaces defined in the topology file:

links:
     - endpoints: ["BorderLeaf_3:eth5", "host:vx102-eth3"]
     - endpoints: ["BorderLeaf_3:eth6", "host:xxx"]
LimeHat commented 2 years ago

Hi @bjmeuer, Could you please provide a bit more details about your environment: os/kernel version, docker version? Do you get any errors or warnings during the destroy procedure?

I have a feeling that this is something docker-related, but let's see.

bjmeuer commented 2 years ago

Hi @bjmeuer, Could you please provide a bit more details about your environment: os/kernel version, docker version? Do you get any errors or warnings during the destroy procedure?

I have a feeling that this is something docker-related, but let's see.

Hi @LimeHat , here are the requested information on the environment:

Linux 5.4.0-104-generic #118-Ubuntu x86_64 x86_64 x86_64 GNU/Linux Docker version 20.10.7, build f0df350

There are no errors, the destroy removes the containers but the interfaces for a container which had interface defined with "host:..." in the topology are removed only about 2 minutes after the container was removed. This could be a docker thing but maybe there is a possibility to trigger the removal from containerlab at the same time the container is removed.

I also tried with the latest docker-ce 20.10.14 with the same result

LimeHat commented 2 years ago

Hmm. I'm unable to reproduce this.

Can you please check a few more things 1) While your lab is up & running, grab PID of a container with the remote end such a link (e.g. BorderLeaf_3 from the example above) sudo docker inspect -f '{{.State.Pid}}' BorderLeaf_3 2) check ns list via sudo ls -la /proc/_PID_/ns 3) destroy the lab, and within those 2 minutes, 4) repeat step#2, sudo ls -la /proc/_PID_/ns 5) check if there any container leftovers in docker, sudo docker container ls -a

Is this a large topology? What if you do this with just a single node?

bjmeuer commented 2 years ago

here are the outputs you requested

Running lab

***@***.****:*~*$ sudo docker inspect -f '{{.State.Pid}}'
AVD-BorderLeaf_1

20321

***@***.****:*~*$ sudo ls -la /proc/20321/ns

total 0

dr-x--x--x 2 root root 0 Apr  8 05:47 .

dr-xr-xr-x 9 root root 0 Apr  8 05:47 ..

lrwxrwxrwx 1 root root 0 Apr  8 05:50 cgroup -> 'cgroup:[4026531835]'

lrwxrwxrwx 1 root root 0 Apr  8 05:47 ipc -> 'ipc:[4026532793]'

lrwxrwxrwx 1 root root 0 Apr  8 05:47 mnt -> 'mnt:[4026532791]'

lrwxrwxrwx 1 root root 0 Apr  8 05:47 net -> 'net:[4026532796]'

lrwxrwxrwx 1 root root 0 Apr  8 05:47 pid -> 'pid:[4026532794]'

lrwxrwxrwx 1 root root 0 Apr  8 05:50 pid_for_children -> 'pid:[4026532794]'

lrwxrwxrwx 1 root root 0 Apr  8 05:50 user -> 'user:[4026531837]'

lrwxrwxrwx 1 root root 0 Apr  8 05:47 uts -> 'uts:[4026532792]'

Destroye lab (within 2 minutes)

***@***.****:*~*$ sudo docker inspect -f '{{.State.Pid}}'
AVD-BorderLeaf_1

Error: No such object: AVD-BorderLeaf_1

***@***.****:*~*$ sudo ls -la /proc/20321/ns

ls: cannot access '/proc/20321/ns': No such file or directory

***@***.****:*~*$ ip link | grep vx

864: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

866: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

868: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

870: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

872: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

874: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

876: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

878: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

880: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

882: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

884: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

886: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

890: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

892: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

894: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

896: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

898: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

902: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

910: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

914: ***@***.***: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9500 qdisc
noqueue state UP mode DEFAULT group default

The last output shows that even after destroying the lab the end from the veth in the root namespace are not removed. The corresponding veth end from the container namespace is removed (or at least the whole container namespace is removed).

You are right with the assumption that it has something to do with my topology. I did several tests now and it seems as soon as I have the binds and the startup-config defined it happens to take longer to remove the interfaces. So this may then be something we can't really influence from the containerlab itself and we have to take care in our automation to wait for the interface to be cleaned or take care of the deletion with "ip link delete" command.

For your reference, this is the topology I'm using:

---

name: AVD

prefix: __lab-name

mgmt:

  network: MGMT

  ipv4_subnet: 192.168.111.0/24

topology:

  nodes:

    BorderLeaf_1:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.143

      kind: ceos

      startup-config: CL_2_configs/BorderLeaf_1.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/BorderLeaf_1.json:/mnt/flash/EosIntfMapping.json:ro

    BorderLeaf_4:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.146

      kind: ceos

      startup-config: CL_2_configs/BorderLeaf_4.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/BorderLeaf_4.json:/mnt/flash/EosIntfMapping.json:ro

    Leaf_1:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.137

      kind: ceos

      startup-config: CL_2_configs/Leaf_1.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/Leaf_1.json:/mnt/flash/EosIntfMapping.json:ro

    Leaf_4:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.140

      kind: ceos

      startup-config: CL_2_configs/Leaf_4.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/Leaf_4.json:/mnt/flash/EosIntfMapping.json:ro

    Spine_1:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.131

      kind: ceos

      startup-config: CL_2_configs/Spine_1.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/Spine_1.json:/mnt/flash/EosIntfMapping.json:ro

      #startup-delay: 30

    Spine_4:

      image: ceos-20220210:4.28.0F

      mgmt_ipv4: 192.168.111.134

      kind: ceos

      startup-config: CL_2_configs/Spine_4.cfg

      enforce-startup-config: true

      binds:

        - CL_2_mappings/Spine_4.json:/mnt/flash/EosIntfMapping.json:ro

  links:

    - endpoints: ["BorderLeaf_1:eth6", "BorderLeaf_4:eth3"]

    - endpoints: ["BorderLeaf_1:eth9", "Spine_1:eth3"]

    - endpoints: ["BorderLeaf_4:eth8", "Spine_4:eth4"]

    - endpoints: ["Leaf_1:eth1", "Spine_1:eth1"]

    - endpoints: ["Leaf_4:eth2", "Spine_4:eth2"]

    - endpoints: ["Spine_4:eth3", "host:vx105-eth8"]

    - endpoints: ["Spine_4:eth1", "host:vx109-eth2"]

    - endpoints: ["BorderLeaf_1:eth1", "host:vx113-dummy1"]

    - endpoints: ["BorderLeaf_1:eth2", "host:vx114-dummy2"]

    - endpoints: ["BorderLeaf_1:eth3", "host:vx115-dummy3"]

    - endpoints: ["BorderLeaf_1:eth4", "host:vx116-dummy4"]

    - endpoints: ["BorderLeaf_1:eth5", "host:vx117-eth3"]

    - endpoints: ["BorderLeaf_1:eth7", "host:vx119-eth1"]

    - endpoints: ["BorderLeaf_1:eth8", "host:vx120-eth1"]

    - endpoints: ["BorderLeaf_1:eth10", "host:vx122-eth3"]

    - endpoints: ["BorderLeaf_4:eth1", "host:vx123-dummy1"]

    - endpoints: ["BorderLeaf_4:eth2", "host:vx124-dummy2"]

    - endpoints: ["BorderLeaf_4:eth5", "host:vx125-eth4"]

    - endpoints: ["BorderLeaf_4:eth6", "host:vx126-eth4"]

    - endpoints: ["BorderLeaf_4:eth7", "host:vx127-eth4"]

    - endpoints: ["Leaf_1:eth2", "host:vx130-eth1"]

    - endpoints: ["Leaf_4:eth1", "host:vx131-eth2"]

    - endpoints: ["BorderLeaf_4:eth4", "host:vx138-eth6"]

    - endpoints: ["Spine_1:eth4", "host:vx141-eth9"]

    - endpoints: ["Spine_1:eth2", "host:vx145-eth1"]

Thanks

On Fri, Apr 8, 2022 at 7:48 AM Sergey Fomin @.***> wrote:

Hmm. I'm unable to reproduce this.

Can you please check a few more things

  1. While your lab is up & running, grab PID of a container with the remote end such a link (e.g. BorderLeaf_3 from the example above) sudo docker inspect -f '{{.State.Pid}}' BorderLeaf_3
  2. check ns list via sudo ls -la /proc/PID/ns
  3. destroy the lab, and within those 2 minutes,
  4. repeat step#2, sudo ls -la /proc/PID/ns
  5. check if there any container leftovers in docker, sudo docker container ls -a

Is this a large topology? What if you do this with just a single node?

— Reply to this email directly, view it on GitHub https://github.com/srl-labs/containerlab/issues/842#issuecomment-1092464024, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALINIRPAGRVWUGMS7YWC65LVD7CBDANCNFSM5SU7DNRA . You are receiving this because you were mentioned.Message ID: @.***>

hellt commented 2 years ago

yep, indeed host interfaces are getting deleted nicely right now I remember though that I saw that once or twice on my system as well, but never bothered to dig deep.

LimeHat commented 2 years ago

Hmm. Interesting. Can you try to run destroy with the --graceful and --debug parameters to see if this makes any difference? And then attach debug log from that run

I did several tests now and it seems as soon as I have the binds and the startup-config defined it happens to take longer to remove the interfaces.

But even without binds, it takes more than a few seconds? What if you deploy/destroy just 1 node?

Also, when you have these lingering veths after the lab is destroyed, can you grab an output from grep docker /proc/mounts

hellt commented 2 years ago

I am closing this as it seems we can't reproduce this Feel free to reopen with more data or we will reopen should we stumble upon it again.

bjmeuer commented 2 years ago

Hey Sergey,

sorry for the late reply, could not test it the last few days.

Attached are the logs you requested.

I tried it also with only 1 node and it works fine, so I think it can happen in scaled deployments. It might be that docker just needs some time to clean it all up...

Thanks for your help. Bjoern

On Fri, Apr 8, 2022 at 11:09 PM Sergey Fomin @.***> wrote:

Hmm. Interesting. Can you try to run destroy with the --graceful and --debug parameters to see if this makes any difference? And then attach debug log from that run

I did several tests now and it seems as soon as I have the binds and the startup-config defined it happens to take longer to remove the interfaces.

But even without binds, it takes more than a few seconds? What if you deploy/destroy just 1 node?

— Reply to this email directly, view it on GitHub https://github.com/srl-labs/containerlab/issues/842#issuecomment-1093364027, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALINIRN6Q2EQDAB3OZLFYTDVECOCDANCNFSM5SU7DNRA . You are receiving this because you were mentioned.Message ID: @.***>

@.***:~/AVD$ sudo clab destroy -t CL_2_custom_topology.yml --cleanup --debug --graceful DEBU[0000] We got the following topos struct for destroy: map[CL_2_custom_topology.yml:{}] DEBU[0000] going through extracted topos for destroy, got a topo file CL_2_custom_topology.yml and generated opts list [0x1d5b200 0x1d21380 0x1d219e0] DEBU[0000] envN runtime var value is
DEBU[0000] Running runtime.Init with params &{Timeout:2m0s GracefulShutdown:true Debug:true KeepMgmtNet:false} and &{Network: Bridge: IPv4Subnet: IPv4Gw: IPv6Subnet: IPv6Gw: MTU: ExternalAccess:} DEBU[0000] Runtime: Docker
DEBU[0000] initialized a runtime with params &{config:{Timeout:120000000000 GracefulShutdown:true Debug:true KeepMgmtNet:false} Client:0xc000124c80 mgmt:0xc000124680} DEBU[0000] template variables:
DEBU[0000] topology:

name: AVD

prefix: __lab-name

mgmt: network: MGMT ipv4_subnet: 192.168.111.0/24

topology: nodes: BorderLeaf_1: image: ceos-20220210:4.28.0F mgmt_ipv4: 192.168.111.143
kind: ceos startup-config: CL_2_configs/BorderLeaf_1.cfg enforce-startup-config: true binds:

DEBU[0000] method initMgmtNetwork was called mgmt params &{Network:MGMT Bridge: IPv4Subnet:192.168.111.0/24 IPv4Gw: IPv6Subnet: IPv6Gw: MTU: ExternalAccess:} DEBU[0000] New mgmt params are &{Network:MGMT Bridge: IPv4Subnet:192.168.111.0/24 IPv4Gw: IPv6Subnet: IPv6Gw: MTU:1500 ExternalAccess:0xc00024726f} INFO[0000] Parsing & checking topology file: CL_2_custom_topology.yml DEBU[0000] node config: &{ShortName:BorderLeaf_1 LongName:AVD-BorderLeaf_1 Fqdn:BorderLeaf_1.AVD.io LabDir:/home/testuser/AVD/clab-AVD/BorderLeaf_1 Index:0 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.143 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] node config: &{ShortName:BorderLeaf_4 LongName:AVD-BorderLeaf_4 Fqdn:BorderLeaf_4.AVD.io LabDir:/home/testuser/AVD/clab-AVD/BorderLeaf_4 Index:1 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.146 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] node config: &{ShortName:Leaf_1 LongName:AVD-Leaf_1 Fqdn:Leaf_1.AVD.io LabDir:/home/testuser/AVD/clab-AVD/Leaf_1 Index:2 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.137 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] node config: &{ShortName:Leaf_4 LongName:AVD-Leaf_4 Fqdn:Leaf_4.AVD.io LabDir:/home/testuser/AVD/clab-AVD/Leaf_4 Index:3 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.140 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] node config: &{ShortName:Spine_1 LongName:AVD-Spine_1 Fqdn:Spine_1.AVD.io LabDir:/home/testuser/AVD/clab-AVD/Spine_1 Index:4 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.131 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] node config: &{ShortName:Spine_4 LongName:AVD-Spine_4 Fqdn:Spine_4.AVD.io LabDir:/home/testuser/AVD/clab-AVD/Spine_4 Index:5 Group: Kind:ceos StartupConfig: StartupDelay:0 EnforceStartupConfig:false ResStartupConfig: Config: ResConfig: NodeType: Position: License: Image:ceos-20220210:4.28.0F Sysctls:map[] User: Entrypoint: Cmd: Exec:[] Env:map[] Binds:[] PortBindings:map[] PortSet:map[] NetworkMode: MgmtNet: MgmtIntf: MgmtIPv4Address:192.168.111.134 MgmtIPv4PrefixLength:0 MgmtIPv6Address: MgmtIPv6PrefixLength:0 MacAddress: ContainerID: TLSCert: TLSKey: TLSAnchor: NSPath: Publish:[] ExtraHosts:[] Labels:map[] Endpoints:[] Sandbox: Kernel: Runtime: CPU:0 CPUSet: Memory: DeploymentStatus: Extras:} DEBU[0000] Filterstring: containerlab=AVD
INFO[0000] Destroying lab: AVD
INFO[0000] Stopping container: AVD-BorderLeaf_1
INFO[0000] Stopping container: AVD-Spine_4
INFO[0000] Stopping container: AVD-BorderLeaf_4
INFO[0000] Stopping container: AVD-Leaf_4
INFO[0000] Stopping container: AVD-Spine_1
INFO[0000] Stopping container: AVD-Leaf_1
DEBU[0122] Removing container: AVD-Spine_1
DEBU[0122] Removing container: AVD-Leaf_1
DEBU[0122] Removing container: AVD-Spine_4
DEBU[0122] Removing container: AVD-BorderLeaf_4
DEBU[0122] Removing container: AVD-BorderLeaf_1
INFO[0122] Removed container: AVD-Spine_4
DEBU[0122] Worker 4 terminating...
DEBU[0122] Removing container: AVD-Leaf_4
INFO[0122] Removed container: AVD-Leaf_1
DEBU[0122] Worker 1 terminating...
INFO[0122] Removed container: AVD-BorderLeaf_1
DEBU[0122] Worker 5 terminating...
INFO[0122] Removed container: AVD-BorderLeaf_4
DEBU[0122] Worker 0 terminating...
INFO[0122] Removed container: AVD-Spine_1
DEBU[0122] Worker 3 terminating...
INFO[0122] Removed container: AVD-Leaf_4
DEBU[0122] Worker 2 terminating...
INFO[0122] Removing containerlab host entries from /etc/hosts file DEBU[0122] Calling DeleteNet method. *CLab.Config.Mgmt value is: &{Network:MGMT Bridge: IPv4Subnet:192.168.111.0/24 IPv4Gw: IPv6Subnet: IPv6Gw: MTU:1500 ExternalAccess:0xc00024726f} DEBU[0123] Removing clab iptables rules for bridge "br-c781dbef597a" DEBU[0123] Deleting AVD-BorderLeaf_1 network namespace
DEBU[0123] Deleting netns symlink: AVD-BorderLeaf_1
DEBU[0123] Deleting AVD-BorderLeaf_4 network namespace
DEBU[0123] Deleting netns symlink: AVD-BorderLeaf_4
DEBU[0123] Deleting AVD-Leaf_1 network namespace
DEBU[0123] Deleting netns symlink: AVD-Leaf_1
DEBU[0123] Deleting AVD-Leaf_4 network namespace
DEBU[0123] Deleting netns symlink: AVD-Leaf_4
DEBU[0123] Deleting AVD-Spine_1 network namespace
DEBU[0123] Deleting netns symlink: AVD-Spine_1
DEBU[0123] Deleting AVD-Spine_4 network namespace
DEBU[0123] Deleting netns symlink: AVD-Spine_4

LimeHat commented 2 years ago

Thanks Bjoern, no worries.

I think I know why it takes ~2 minutes, but I'm a bit puzzled by the fact that it happens sporadically.

In docker libnetwork they have a garbage collection mechanism that kicks in after a 60-second delay https://github.com/moby/libnetwork/blob/339b972b464ee3d401b5788b2af9e31d09d6b7da/osl/namespace_linux.go#L37 At least this is my best guess.

But I'm not sure what is the exact root cause that causes GC vs immediate netns removal; since just the presence of those veths is not enough and works fine in smaller/simpler setups. It appears that either something works different @ scale (but it kind of shouldn't.. since each container is a separate netns anyway; unless the number of links plays a role), or something happens inside the container (mounts, etc) that causes the blocking behavior (but in that case you should see it with a single node topo, if you use the same config options).

Anyhow. If this is something you guys need (re-deployment within 2-minute intervals), I think we can add an additional cleanup mechanism to containerlab. Not the prettiest option, but should be doable. If you don't mind me asking (just curious), what's the usecase? Some kind of automated CI/CD pipeline?

hellt commented 2 years ago

@LimeHat the cleanup mechanism that you're thinking of is along the lines of removing the link with netlink should if have been detected to be present after we destroyed containers?

LimeHat commented 2 years ago

I think the obvious way to do it is just to parse the topology, extract veths that are placed in the host ns (similar to how we do it for veth initialization / netns placement today), and then call netlink del before we delete the containers. (thinking about that now, though, I'm not sure if netlink will actually allow this removal to happen unconditionally)

hellt commented 2 years ago

I think if we do this after we remove the containers, then we can leverage an already parsed topology info to go and try remove the potentially stale host links. that way containers should already be gone by now and we can try to remove the host links first ensuring that the link is still dangling

hellt commented 2 years ago

Hi @bjmeuer @LimeHat added a removal of dangling veths, do you have a chance to test it with your topo that experienced the original issue using the beta build?

This will get you a build in your PWD

docker run --rm -v $(pwd):/workspace ghcr.io/oras-project/oras:v0.12.0 pull ttl.sh/clab-d687094a:1d
hellt commented 2 years ago

the (potential) fix has been merged. Let's re-open this one if the issue persists with soon to be released 0.26.0 version

bjmeuer commented 2 years ago

Hey @LimeHat, @hellt

I did load the version with the fix and compiled it and it works perfectly fine with the topology I have. The interfaces are cleaned up immediately after the containers are brought down.

Thanks a lot for the quick help here.

Best regards, Bjoern

On Mon, Apr 18, 2022 at 11:03 AM Roman Dodin @.***> wrote:

Hi @bjmeuer https://github.com/bjmeuer @LimeHat https://github.com/LimeHat added a removal of dangling veths, do you have a chance to test it with your topo that experienced the original issue using the beta build?

This will get you a build in your PWD

docker run --rm -v $(pwd):/workspace ghcr.io/oras-project/oras:v0.12.0 pull ttl.sh/clab-d687094a:1d

— Reply to this email directly, view it on GitHub https://github.com/srl-labs/containerlab/issues/842#issuecomment-1101239704, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALINIRNG5HH7UITYYAWQU6DVFUQOJANCNFSM5SU7DNRA . You are receiving this because you were mentioned.Message ID: @.***>