ipspace / netlab

Making virtual networking labs suck less
https://netlab.tools
Other
458 stars 69 forks source link

[BUG] lag module fails in case of Multiprovider labs #1581

Open jbemmel opened 4 days ago

jbemmel commented 4 days ago

Just a note that Multiprovider labs don't combine with the lag module at the moment, due to libvirt changing the link type from "lag" to "lan"

groups:
 _auto_create: True
 core:
  members: [ c1, c2 ]
  device: dellos10
  provider: clab
  module: [ lag ]
 edge:
  members: [ e1, e2 ]
  device: cumulus_nvue
  provider: libvirt
  module: [ lag ]
 hosts:
  members: [ h1, h2, h3, h4 ]
  device: linux

links:
- lag:
   mlag.peergroup: 1
   members: [c1-c2]
- lag:
   mlag: True
   members: [e1-c1,e1-c2]
- lag:
   mlag: True
   members: [e2-c1,e2-c2]
ipspace commented 4 days ago

Even if we get that fixed (and based on the recent SNAFU I caused in bf8ac79e594499fd4623c6242baddb2a337fe614 and fixed in e58b8eb8829233451159575e34680cb2fc8e7bbc I'm a bit reluctant to try and get around this limitation) we don't know if LACP works over Linux bridges, so it might be worth trying that out first.

If my hunch is right and LACP does not work across bridges, then we just have to document the caveat in the LAG module documentation and move on.

jbemmel commented 4 days ago

If my hunch is right and LACP does not work across bridges, then we just have to document the caveat in the LAG module documentation and move on.

Your 2020 article still applies - see https://github.com/ipspace/netlab/pull/1582

LACP gets discarded, but FRR happily creates the bond and doesn't alert the user that LACP isn't working. Cumulus declares the bond status as 'DOWN' on the libvirt side of the link, not sure if this "MII status" is due to LACP failing

I can add a quirk to the lag module warning about this limitation of LACP with Linux bridges. However, static bonding could still work in multi-provider cases

ipspace commented 4 days ago

LACP gets discarded, but FRR happily creates the bond and doesn't alert the user that LACP isn't working. Cumulus declares the bond status as 'DOWN' on the libvirt side of the link, not sure if this "MII status" is due to LACP failing

Let's just say that LAGs without LACP are as reliable as parallel static routes pointing to multiple uplinks, and adding a bridge in the middle only makes it worse.

Alas, unpleasant facts never stopped "creative" people.

I can add a quirk to the lag module warning about this limitation of LACP with Linux bridges.

Your choice.

However, static bonding could still work in multi-provider cases

Or you could do this, but then it's time to go back to the drawing board and get rid of the "lag" link type. I never expected it to be used in multi-provider environment because I knew LACP would not work ;)