ipspace / netlab

Making virtual networking labs suck less
https://netlab.tools
Other
437 stars 65 forks source link

[BUG] VLANs don't work on IOL/IOLL2 #1381

Open ipspace opened 3 days ago

ipspace commented 3 days ago

A topology using VLANs on IOL/IOLL2 crashes during "netlab initial". The initial configuration template tries to include platform-specific VLAN configuration, and those files don't exist for iol/ioll2. We could either create symlinks or change the include logic.

Sample test scenario: tests/integration/vlan/01-vlan-bridge-single.yml

DanPartelly commented 3 days ago

This is a more serious bug. Symlinking iosvl2 results in the configuration being deployed successfully, but the interfaces come up with "no switchport". Still at work, will look into it later.

ipspace commented 3 days ago

Symlinking initial/iosvl2.vlan.j2 into initial/ioll2.vlan.j2 and vlan/iosvl2.j2 into vlan/ioll2.j2 resulted in working 01-vlan-bridge-simple.yml test. Will run the full set of VLAN integration tests once the BGP plugin ones finish.

IOL is a different story. It does not have the vlan database, but also does not work with the IOS bridging configuration. You can't even configure the IEEE STP (which is a huge red flag). I would suggest we declare VLAN unsupported on IOL unless you really want to figure out how to make it work ;)

DanPartelly commented 3 days ago

Indeed, it does work. And no, I do not have an immediate itch to figure this out. Id rather spend the time I have learning more about netlab internals and explore the test suite. I learned a lot those days, and your comments where very useful, but there is much more left.

ipspace commented 3 days ago

So, I ran the VLAN integration tests for IOLL2 and all the more complex ones failed. The results are here:

https://tests.netlab.tools/_html/ioll2-clab-vlan

Unfortunately, there's not much one can do to validate the VLAN setups apart from end-to-end pings, so the errors are not particularly enlightening. If you want to fix stuff, it's best if you spin up one of the failing scenarios, figure out what's wrong, fix the config, and repeat.

ipspace commented 3 days ago

I created the ioll2_vlan branch with the initial changes. You could start from there, do additional configuration tweaks for IOLL2, and then submit the PR, either against the ioll2_vlan branch or the dev branch.

ipspace commented 3 days ago

I think I found the root cause: all IOLL2 instances have the same base MAC address (STP system ID), so the trunk ports go into blocking because the switches think they hear themselves.

No idea how to change that on IOLL2 :(

DanPartelly commented 3 days ago

How the heck did you figured that out ? Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

jbemmel commented 2 days ago

Apparently VIRL can do it: https://learningnetwork.cisco.com/s/question/0D53i00000KszBMCAZ/change-switch-base-mac-in-virl-and-remove-management-ports-from-stp-evaluation

Otherwise, we could start with supporting at most 1 node per topology

ipspace commented 2 days ago

How the heck did you figured that out?

The trunk port was not in the list of active VLAN ports, so I started investigating. It was blocking, so STP was the culprit. STP claimed the device is the root bridge, so I started looking at STP details and found that both devices use the same system ID.

Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

It's definitely possible (or VIRL wouldn't be able to do it), but I couldn't figure out how.

Anyway, looking at GNS3 code, it looks like IOL can take node ID, and the GNS3 code has "512 + id" in https://github.com/GNS3/gns3-server/blob/225779bc11a0d5a5af6aeb2c9a7642639cf3da06/gns3server/compute/iou/iou_vm.py#L776, and there's hard-coded 513 in https://github.com/hellt/vrnetlab/blob/master/cisco/iol/docker/entrypoint.sh#L14 so... 🤔

ipspace commented 2 days ago

Otherwise, we could start with supporting at most 1 node per topology

@DanPartelly: I would start with a very strong caveat saying "bridge domains don't work on IOL, so we disabled VLANs, and all IOL-L2 nodes use the same System ID, so you can have only one IOL-L2 node in the bridging domain". I can also add the same caveat to integration tests.

Without that, the current state of IOL-L2 is a release show-stopper. We can't release a broken functionality that is not described in caveats.

kaelemc commented 2 days ago

How the heck did you figured that out?

The trunk port was not in the list of active VLAN ports, so I started investigating. It was blocking, so STP was the culprit. STP claimed the device is the root bridge, so I started looking at STP details and found that both devices use the same system ID.

Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

It's definitely possible (or VIRL wouldn't be able to do it), but I couldn't figure out how.

Anyway, looking at GNS3 code, it looks like IOL can take node ID, and the GNS3 code has "512 + id" in https://github.com/GNS3/gns3-server/blob/225779bc11a0d5a5af6aeb2c9a7642639cf3da06/gns3server/compute/iou/iou_vm.py#L776, and there's hard-coded 513 in https://github.com/hellt/vrnetlab/blob/master/cisco/iol/docker/entrypoint.sh#L14 so... 🤔

@ipspace Hey, I did the integration for IOL in Containerlab, and i'm currently working on a fix to this. I discussed this with @DanPartelly in the Containerlab Discord.

To sum it up, the system base MAC is set by the PID which the IOL binary launches as. You have to set a PID when executing the IOL binary, The entrypoint script for the container statically sets the PID to 1.

NETMAP uses the PID to bind the IOL processes interfaces to UDP ports, then IOUYAP will bind the UDP port to the linux container interfaces (eth0, eth1 etc.).

It should be easy enough to signal a PID to the entrypoint script when launching the container in containerlab, the problem is just making sure each IOL node has a unique PID that can persist reboots.

VIRL/CML launches IOL in LXCs and has some mechanism to increment the PID that IOL launches with to make sure there are no overlaps between the nodes.

kaelemc commented 2 days ago

FYI, @ipspace Big fan of your blog and your work.

I see in a recent commit that the docs have been edited to say Catalyst 8000v doesn't support MPLS.

Maybe you are already aware of this but you just have to upgrade the boot license to 'advantage' or 'premier' for MPLS/SRv6 support. vrnetlab already does this with

license boot level network-premier addon dna-premier

Since we initially boot the node in the container build process, the license is applied in the bootstrap config. Then when the node is booted in a containerlab topology this license will have been applied on boot.

DanPartelly commented 2 days ago

I totally agree. Documentation was always a first class citizen in networklab, few tools are so well documented.

When do you want to release next version ? If it is not right around the corner , maybe we can give it a few days. Weekend is here in a day and we can work on it. I will keep in touch with @kaelemc on this issue, with his permission.

@DanPartelly: I would start with a very strong caveat saying "bridge domains don't work on IOL, so we disabled VLANs, and all IOL-L2 nodes use the same System ID, so you can have only one IOL-L2 node in the bridging domain". I can also add the same caveat to integration tests.

Without that, the current state of IOL-L2 is a release show-stopper. We can't release a broken functionality that is not described in caveats.

kaelemc commented 2 days ago

I've submitted the PRs which fix this.

Even in the CML the base of the MAC is aabb.cc00. Sadly I don't think we can change that. But this should be enough.

https://github.com/srl-labs/containerlab/pull/2239 https://github.com/hellt/vrnetlab/pull/270

image

ipspace commented 2 days ago

I've submitted the PRs which fix this.

That was fast, thanks a million.

@DanPartelly: I would suggest we still add that caveat explaining what's going on (so we can push out a new release at any time), and once the new containerlab version comes out, I run the integration tests, change the containerlab release in the installation script, and we revise the caveats. OK?

ipspace commented 2 days ago

When do you want to release next version?

No rush, we don't have any major feature to push out (but have accumulated enough stuff so I'm not comfortable with a -post1 release), I just like to have my Ts crossed ;)

ipspace commented 2 days ago

FYI, @ipspace Big fan of your blog and your work.

Thank you!

I see in a recent commit that the docs have been edited to say Catalyst 8000v doesn't support MPLS.

Maybe you are already aware of this but you just have to upgrade the boot license to 'advantage' or 'premier' for MPLS/SRv6 support. vrnetlab already does this with

Thanks a million, will add to the initial configuration script (in case someone is running a Cat8K VM) and run the tests.

kaelemc commented 2 days ago

@ipspace No problem Netlab looks really cool and could be of some use for me. I'm currently a heavy user of IOS-XR, but XRv runs too old of a software version (6.x) and XRv9k is well.. too heavy.

I'm curious, how much effort do you think it would be for me to integrate XRd support into netlab?.

I would say XRd is almost on par with the the containerised IOL, fast boot, instant commits and 90% feature parity with the full fat XR VMs.

I assume it's not that much work as XR support is somewhat existent with XRv/9k? Maybe just adding the relevant provider 'stuff'? (sorry not too familiar with the project code).

ipspace commented 2 days ago

I'm curious, how much effort do you think it would be for me to integrate XRd support into netlab?

I think it's working: https://netlab.tools/platforms/#supported-virtualization-providers

I never tried it myself, but someone submitted XRv patches and claimed it was running for him.

kaelemc commented 2 days ago

@ipspace I meant XRd, as in: the containerised version of IOS-XR, would only work from the containerlab provider (I assume; unless someone built a VM which runs the container...).

Not the virtualised ones like XRv or XRv9k.

Unless you are saying this is already supported?

ipspace commented 2 days ago

I'm saying this should already be supported. It uses ios-xr/xrd-control-plane:7.11.1 image (obviously that can be changed) and containerlab provider.

kaelemc commented 2 days ago

I'm saying this should already be supported. It uses ios-xr/xrd-control-plane:7.11.1 image (obviously that can be changed) and containerlab provider.

Awesome, thanks. I'll give it a shot 😊. Sorry for clouding this issue with XR stuff.

DanPartelly commented 2 days ago

Yes, we should add the caveats.

  1. It needs not only the next containerlab version, it also needs master (after PR lands) of vrnetlab. But thats what people use anyway.
  2. Furthermore , the bridge id is now built with the help of a variable which increases by one for each node. Nodes are sorted alphabetically so this means that if topology changes and nodes are added, or node names are changed the internal index will change, and so will the bridge ID for the device.

    I think point 2 should be documented as a caveat too.

Ill run more tests this evening with the netlab more complex VLAN toplogies. Ive run a simple test and i have rstp up.

I would suggest we still add that caveat explaining what's going on (so we can push out a new release at any time),

DanPartelly commented 2 days ago

@ipspace . Ive ran almost all the VLAN test battery. The first 5 tests - prefix 01- to 23 all succeed. The second trunking was involved starting with test prefix 31, all went south. nothing worked anymore. If anyone has any fast ideas, Im all ears.

ipspace commented 2 days ago

@ipspace . Ive ran almost all the VLAN test battery. The first 5 tests - prefix 01- to 23 all succeed. The second trunking was involved starting with test prefix 31, all went south. nothing worked anymore. If anyone has any fast ideas, Im all ears.

Yes, the moment you add the second IOLL2 node the "duplicate STP system ID" kicks in. We have to wait for the vrnetlab/containerlab fixes.

ipspace commented 2 days ago
  1. Furthermore , the bridge id is now built with the help of a variable which increases by one for each node. Nodes are sorted alphabetically so this means that if topology changes and nodes are added, or node names are changed the internal index will change, and so will the bridge ID for the device.

I think point 2 should be documented as a caveat too.

Of course we should document it (give me a day or so), but this just makes it more like real life where you never know who the root bridge will be after you add a node to the network (unless you set bridge priorities). Nonetheless, if you don't rename IOLL2 nodes, their relative order will not change, and the node with the highest MAC address will stay the same.

DanPartelly commented 2 days ago

Yes, the moment you add the second IOLL2 node the "duplicate STP system ID" kicks in. We have to wait for the vrnetlab/containerlab fixes.

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

ipspace commented 2 days ago

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

Oh, so it's worse than I thought. No further ideas at the moment, will wait for the new releases. I could rebuild the IOL container, but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's doing more than that.

DanPartelly commented 2 days ago

In theory yes, you could do that and pass a unique PID to the image. But you have more important things to do probably. Ill spend more time on it in weekend, and we can safely wait until next release.

but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's >>doing more than that.

kaelemc commented 2 days ago

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

Oh, so it's worse than I thought. No further ideas at the moment, will wait for the new releases. I could rebuild the IOL container, but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's doing more than that.

Yeah containerlab generates the NETMAP and IOUYAP files. NETMAP needs to know the PID of the IOL container so that it can do it's IOL->Container interface binding magic (with IOUYAP).

Manually changing the PID will not get you connectivity into IOL and ports won't work.

You can always use the gh actions build artifacts. Containerlab artifact download

image

ipspace commented 1 day ago

The baseline settings and caveats are in #1390. We should merge that one to stop 'netlab initial' crashes and to disable VLANs on IOL.