noxrepo / pox

The POX network software platform
https://noxrepo.github.io/pox-doc/html/
Apache License 2.0
619 stars 470 forks source link

ip_loadbalancer issue #293

Closed tomekmaz723 closed 4 months ago

tomekmaz723 commented 9 months ago

openflow-new-page

I did such configuration but it's not working as expected (in mininet everything is fine). All the VMs are in a different LAN Segments (VMware). When I'm turning on - _./pox.py --verbose forwarding.l2learning it's working as expected I can ping from host to servers but when I'm typing _./pox.py --verbose misc.iploadbalancer --ip=10.0.1.1 --servers=10.0.0.1,10.0.0.2 But the log info did not show "Server XX up"

Bridge br0 Controller "tcp:10.0.0.100:6633" fail_mode: secure Port ens33 Interface ens33 Port ens37 Interface ens37 Port br0 Interface br0 type: internal Port ens38 Interface ens38 ovs_version: "2.17.8"

Do you have any idea what can be worng or how can I correct ip_loadbalancer module?

MurphyMc commented 9 months ago

Are the vlans visible to OpenFlow (the ports in trunked mode or whatever)?

If so, that's the problem. ip_loadbalancer doesn't know anything about vlans and doesn't know what vlan anything is on. This will probably show up in a number of ways, but the first one is that _do_probe() constructs ARP packets to probe the servers, and the ARP packets it constructs don't have VLAN headers, so the server will never see them and respond.

Easy solution: use VLAN access ports instead of trunk ports.

Harder solution: modify ip_loadbalancer so it has VLAN intelligence (if done nicely and backwards compatibly, this would probably get merged if you wanted to share it). I suspect the right thing would be to have --servers also take an optional VLAN number for each server, e.g., --servers=10.0.0.1#16,10.0.0.2#18.

tomekmaz723 commented 9 months ago

Hi, thanks for quick response! The configuration: image OVS <-> S1 - ovs-vsctl add-port br0 ens37 OVS <-> Controller - just ip addresses configured (when I was adding ovs-vsctl add-port br0 ens39 - even forwarding.l2_learning didn't work)

so I tried your idea: OVS <-> S1 - ovs-vsctl add-port br0 ens37 tag=9 OVS <-> Controller ovs-vsctl add-port br0 ens39 tag=9 (lost connection with controller so I turned back to configuration without access/trunk but still didn't work, basic forwarding was good)

For me it looks like ARP is working image

(same issue as this guy 10 years ago :D - https://pox-dev.noxrepo.narkive.com/zTPw06MB/how-to-use-module-misc-ip-loadbalancer) btw: I'm getting errors: image

Do you have maybe an idea how to do workaround or how to configure OVS (I think there is an issue)

MurphyMc commented 9 months ago

So OVS can reach the controller normally; seems like no need to do anything special there.

I don't know if H1 is on a VLAN or not, but perhaps it doesn't matter for the moment. The first goal should just be getting the "Server X up" message for a single server. Cool.

You say it looks like ARP is working, and I agree that it looks like S1 is seeing the request and responding, though we don't know if POX is seeing the response (and it's hard to tell for sure without being able to see all the packet details; I'm at least assuming 00:0c:29:86:22:16 is really S1's MAC).

You don't mention whether that screen grab is from the original trunk config or the new access config. I'm guessing the access config. If you run it in the old (trunk) config and capture with Wireshark, does it look like ARP is working? I am guessing it won't have any replies. If that's true, then I think we're on to something with the access setup.

If that's all true, maybe the next thing I'd try is running POX's info.packet_dump component. Maybe info.packet_dump --verbose --show? I haven't used this for a long time; hopefully it still works. This should show what packets are actually getting to POX. In an ideal world, it'll be the ARP responses with no VLAN tag. But it currently seems like POX isn't seeing what it's expecting (or we'd see the server up message), so I'm not sure what we'll see.

tomekmaz723 commented 9 months ago

I tried this but despite "packet dumper running" I didn't get anything, also checked in logs but couldn't find image

MurphyMc commented 9 months ago

Try setting the log level to debug, e.g., log.level --DEBUG (or samples.pretty_log --DEBUG), which is probably a good idea anyway since things aren't working. In fact, put that first.

MurphyMc commented 9 months ago

(And I'd really suggest upgrading to at least POX halosaur.)

tomekmaz723 commented 9 months ago

old POX image

POX halosaur. image

mininet image

MurphyMc commented 9 months ago

I think there's also a flag for packet_dump to set the line width so that lines don't get cut off, which might be nice here. But even from what we can see, I think we have some data.

  1. The DNS parsing error you were seeing in gar is indeed fixed.
  2. What we see in mininet here is what we expect. POX is sending the ARP requests, and these are the replies coming back.
  3. Packets from S1 in the real network do not have VLAN tags when they hit OVS.
  4. There are no ARP replies to the load balancer probes.

Based on 3, this seems like the second screen grab was made with the access port configuration of OVS, not the trunk configuration. I think you should rerun the same experiment as your second screen grab, but using the old trunked port configuration. I think we expect to see the output of packet_dump showing packets with VLAN tags for VLAN 9. Let's confirm that.

Item 4 is surely a problem. We know that ARP replies aren't getting to POX now. So now the question is... why not? The two likely broad answers are: A) the queries aren't getting to S1 in the first place (maybe not getting there at all or maybe getting there on the wrong VLAN or something) or B) the responses aren't getting all the way back to OVS. Rerun the experiment from the second screen grab with the S1 interface (ens37) in access mode. Wireshark ens37 directly (not via "all"). Do you see the queries go out? Do they have tags? Do the responses come back?

tomekmaz723 commented 9 months ago

I made the topology simpler: OVS - S1 ( ens33 - ens33) OVS - C (ens37 - ens37)

  1. I have such configuration - trunk image

OVS - S1 image no tag, normal request response

///////////// configuration: image

OVS-S1 image no tag

As you said the ARP packet is not reaching the controller

What I noticed: image This MAC "Load Balancing on" is a MAC of OVS ens33

MurphyMc commented 9 months ago

There's still some stuff with the VLANs which I am not understanding, but yeah, it seems like the ARPs aren't getting to the controller, and I don't immediately know why. They should be showing up due to packet_dump. I assume you ARE seeing some other packets and they're just not in that last screen grab. I mean, I think we know some DNS packets are getting to the controller anyway, since you used to be getting errors/warnings about them.

So this doesn't seem like it should be it, but try editing proto/arp_responder.py. There's a global variable _install_flow which defaults to None. Change it to True. See if that helps.

Another thing, I guess, is to confirm there are no firewall rules or anything that might be a problem. Maybe try dumping the ebtables and OVS tables to see if there's anything that might be eating ARPs.

tomekmaz723 commented 9 months ago

I changed it but it didn't help.

image image

I also tried to configure ovs and controller on one VM (this is not my goal) but it didn't help neither.

//// checked flows in mininet looks the same but here there is traffic (in ma case packets_n=0) image image

there is lack of this one openflow packet

MurphyMc commented 9 months ago

Is it possible that you try this with all VLAN stuff disabled?

Also, what's your deployment here? Is this all VMs? All on the same physical machine? Or something else?

MurphyMc commented 9 months ago

Ah, I had a thought. It's been a while since I've worked much with OpenFlow, so it took me a minute. I wonder if the problem is OVS being in in-band-control mode.

ip_loadbalancer pretends to be a host a bit. It needs an Ethernet address for that. It makes one up based on the switch's DPID, which is usually the Ethernet address of its main interface. The one ending in :07:74 in your setup above. That's the source address for its ARP probes (and where the replies should come back to).

However, in in-band mode, OVS needs to do some ARP handling. To do this, it sets up some hidden flows, as described in the In-Band Control section of the OVS docs. This includes "ARP replies to the local port’s MAC address". I think that's the problem here. You didn't see the table entries for this because you used ofctl, which can only see what OpenFlow sees. Use ovs-appctl bridge/dump-flows br0. I think after doing that, you'll see there are some hidden table entries stealing the ARPs.

From the topology you described, I don't think you need in-band control, so one fix here would just be turning it off: ovs-vsctl set controller br0 connection-mode=out-of-band.

Another possibility would be changing ip_loadbalancer to not use the Ethernet address that OVS is eating the ARP replies for. I think you just need to change the .mac field of the iplb instance. In ip_loadbalncer.py, in the iplb constructor, it sets self.mac = self.con.eth_addr. Just change that to like... self.mac = EthAddr('00:00:00:00:00:01') or some other arbitrary (unused!) address.

Or it could be something else entirely, but this is my first good guess. :)

tomekmaz723 commented 9 months ago

Hello,

ovs-vsctl set controller br0 connection-mode=out-of-band - resolved the issue. Thanks a lot!! image

///////////

Btw image The traffic is balanced as expected but getting continous logs

MurphyMc commented 9 months ago

Great. This should probably be documented somewhere. At least there's now this issue. :)

I'm surprised you're getting multiple log messages unless there's something strange like multiple machines with the same Ethernet or IP address. Maybe this is a Python 3 regression.

If you wanted to help track this down, maybe you'd insert a little debugging code in ip_loadbalancer.py. There's a comment that says, "Ooh, new server." It keeps ending up in there even though it's not a new server. Right below that, you could add something like...

self.log.info(str(self.live_servers.items())
self.log.info(str(arpp.protosrc in self.live_servers))
self.log.info("%s %s" % (arpp.hwsrc,inport))

.. and then we could look at the logs after some of the repeated log messages and see if it was clear what was going on.

MurphyMc commented 4 months ago

Closing due to lack of activity; feel free to reopen with new data.