Closed gmoleski closed 6 years ago
What version of POX?
I believe I am using version 2.0 of branch carp
A couple quick things to try: 1) Upgrade to POX dart (a more recent version is always the first thing I try) 2) Pass --eat-early-packets to discovery
Thanks for the information but after running it with the latest version of dart I noticed that it is worse than carp. In dart after some time (I imagine link timeout) the nodes fail completely and errors are continuously thrown in the pox cmd. In carp I get some packet losses at the start and then when I re-run my traffic generator packet losses are decreasing but that is if I get the correct link timeout parameter for a certain topology. I will try to play more with the parameters and let you know. Thanks.
Hi! I'm having this same problem! is there any way to fix it?
Post the POX log and a description of what you're doing.
Hi! I'm running mininet with a Fat Tree Topology (TreeTopo depth=3, fanout=4) Once I start the experiment, I try to create a TCP connection between every pair of nodes and exchange one small message (< 100bytes) every 1 second, for thirty seconds).
The problem I am having is that some of the hosts can connect, and in the pox log, I'm getting:
ERROR:openflow.of_01:[00-00-00-00-00-01 6] OpenFlow Error:
[00-00-00-00-00-01 6] Error: header:
[00-00-00-00-00-01 6] Error: version: 1
[00-00-00-00-00-01 6] Error: type: 1 (OFPT_ERROR)
[00-00-00-00-00-01 6] Error: length: 36
[00-00-00-00-00-01 6] Error: xid: 60475
[00-00-00-00-00-01 6] Error: type: OFPET_BAD_REQUEST (1)
[00-00-00-00-00-01 6] Error: code: OFPBRC_BUFFER_UNKNOWN (8)
[00-00-00-00-00-01 6] Error: datalen: 24
[00-00-00-00-00-01 6] Error: 0000: 01 0d 00 18 00 00 ec 3b 00 00 05 11 00 05 00 08 |.......;........|
[00-00-00-00-00-01 6] Error: 0010: 00 00 00 08 ff fb 00 00 |........ |
and
INFO:packet:(ipv6) warning IP packet data incomplete (114 of 149)
INFO:packet:(dns) parsing questions: next_question: truncated
My end goal is to run an internet-like topology using mininet, with at least, 100 nodes.
Post the whole log and the commandline you're using.
Sorry about that! I uploaded everything to a gist here. In the gist you will find:
I've run Pox from the carp
branch with the following parameters:
./pox.py forwarding.l2_learning
Please let me know if there's anything else I can do.
That commandline definitely won't work. See, the following POX FAQ entry, for example: https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-DoesPOXsupporttopologieswithloops%3F
I'd suggest upgrading to POX eel, and then starting with: pox.py forwarding.l2_learning openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down
or
pox.py forwarding.l2_multi openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down
You may have some luck tweaking the discovery timeouts too.
Hi! Thanks for your answer.
I've re-ran the test with eel code and the first command you send. It still doesn't work :(. Here's the full log.
When you mean the discovery timeouts, do you mean my own tcp connections timeouts or something related to pox?
Thanks!
A first note is that I think your life will improve a bit if you disable IPv6.
You might have better luck with the second command. l2_learning with a spanning tree is a pretty silly way to run a network with a whole lot of loops (like a fat tree) anyway.
I believe what's going on here is that packets are getting caught in a loop which is causing discovery to fail. I don't immediately know why this is happening, quite possibly something to do with the event handler priorities... the --eat-early-packets option is meant to prevent this, but it apparently isn't. Try adding something like the following to the very start of the PacketIn handler in the forwarding component you're using (e.g., l2_learning):
if (time.time() - event.connection.connect_time) <= 1.5 * core.openflow_discovery.send_cycle_time: return
See if that helps. The idea is to make sure it doesn't do any forwarding until discovery has had a chance to discover. (Again, this is what --eat-early-packets is meant to make happen, but this is a more direct way of doing it.)
By discovery timeouts, I meant the --link-timeout parameter to openflow.discovery. Increasing it may help things. It's currently coupled to how fast discovery cycles (sends packets out all the ports), which isn't fundamental... you could make a somewhat more "forgiving" version by keeping the cycle time shorter but making the timeouts longer.
Hi! Thanks for your answer!
I've tried the other method and it fails too. Here is the complete log. I ran this command:
./pox.py forwarding.l2_multi openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down
I'm not sure I follow. How does the TreeTopo contains any loops? It is just a tree, no loops. Why would it have any problem? (I opted for this topology because I wanted to discard other possible problems relating loops in the topology).
Maybe I'm doing something that is wrong. I want to simulate a big network (>100 nodes) using mininet and pox. Maybe running everything in a big tree topo is not the correct way.
Ah. You'd said above that it was a fat tree, which has lots of loops. If it's just a plain tree, then that's not the problem. Maybe the problem is just too much going on at once.
It looked like maybe you're running a ping test between everything before starting your test. Is that right? What are the results from that? Are the pings working?
Since there are no loops, there's no need for discovery/spanning_tree. Try:
./pox.py forwarding.l2_pairs
Hi!
I too think the problem is "too much going on at once". But I don't know if that's solvable. I always create 64 nodes, but I choose randomly 20, 32, and 64 of them. For 20, if I run a ping with the 20 nodes, I don't usually get errors. For 32, if I run a ping between the 32 nodes, I sometimes get errors. For 64, running a ping with the 64 nodes, takes ~40 minutes and the few times I tried it, I got errors.
In all the experiments I sleep for 10 seconds before starting the ping, and then for 10 seconds again after the ping.
Running ./pox.py forwarding.l2_pairs
also got me errors.
PS: I was under the impression that Fat Tree didn't have loops either, at least not the Fat Tree used in Maxinet. But the end goal is to test it with a topology with lots of loops, so..
I've just tried to run this example from mininet specifying the POX controller, and I got packet loss during the pings running with ./pox.py forwarding.l2_pairs
and also with ./pox.py forwarding.l2_multi openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down
When people talk about fat trees (at least in my experience in networking), they're almost always referring to the topology discussed by Al-Fares et al., which are also (probably more properly) called "folded Clos networks". Lots of loops.
The fact that you eventually want to run on a topology with lots of loops will hopefully be fine considering the problem you're seeing apparently doesn't have to do with loops. It's definitely easier to use a no-loop network to start off.
Try modifying the example to run the ping tests more than once with a few seconds between. Does it eventually get to 100%?
Sorry if I was not clear enough. When I ran the test with 64 nodes, none of the packets got lost, but I still got connection errors on my tests.
In the example above, I just got some errors the first time I ran it. Each full ping test takes like ~40 minutes with my computer using 100% cpu
Okay, let's get on the same page.
First, pull the very latest version of POX eel. I just pushed a fix to a dumb bug.
Modify the mininet treeping64 example. You may have already done some of this.
result = network.run( network.pingAll )
so that we can insert a pause. Should be something like...network.start()
time.sleep(20) # Give discovery time to cycle
result = network.pingAll()
network.stop()
Then try the modified example with "forwarding.l2_pairs" and "openflow.discovery forwarding.l2_multi". Both of these work fine here. Takes a couple minutes to complete in either case.
Hi! Thank you very much for taking the time to test this out.
I can't see your push on github.
Ugh, I can't reproduce it now. But I swear that it wasn't working yesterday.
The other test I have been doing still fails however :(. (It's more complex than a simple ping, it requires establishing TCP Connections between nodes).
I was able to run 64 nodes in my computer! For this test I did a full ping between all pair of nodes and then waited 20 seconds before and after the ping.
It seems that the problem is then related to timing. Do you know what can be causing this? Is it possible to be sure when can I start sending packets ?
Sorry, I pushed it to my fork, not upstream. I've pushed it upstream now too. I think you should probably upgrade to it. The issue fixed here could definitely mess up your experiments in a somewhat-hard-to-reproduce-reliably way (e.g., depending on when Mininet and POX start up relative to each other).
And right; I understood that your real test was more complex than a simple ping. I wanted to start with a simpler test and work our way up.
Yes, I also believe it has to do with timing, but I don't know exactly what aspect yet. There are at least two possibilities.
If the problem is (1), it's just a matter of waiting a bit to start your test.
If the problem is (2), you could do a number of things. For example, try to improve the performance (e.g., using --unthreaded-sh), or use a different forwarding/routing approach (several possibilities included with POX, or write your own... there are many "better" ways to do it than POX includes, particularly if you're willing to use stuff beyond plain OpenFlow 1.0), or you could "cheat" a bit. This last one can be pretty effective. The idea is to use learning, but to "prime" learning before actually using it. e.g., use l2_pairs, and before you start your real test, you have every host send a single broadcast ping packet or something (perhaps with a small delay between each one). This will give each switch a chance to learn each host so that when the real experiment begins, all the tables are ready. What solution is best depends on what you're trying to do.
I too noticed that POX has some troubles with fattree topos because of loops/discovery timeouts. We worked around this in our experiments by using a custom spanning tree "discovery" which knew exactly the topology from the start and pre-configured the spanning tree accordingly.
Yeah. I actually think the "automatic" discovery isn't the right solution under many circumstances and that giving the controller the topology ahead of time is the right approach. (I believe this for operational networks as well.) For that matter, a spanning tree isn't really the right solution for networks with path diversity. It's all just the "easy" way in some sense.
That said, there have been a number of bugs/problems with POX discovery which have gotten fixed over time (including the push I just made!). I've also made a number of modifications which have never gotten pushed which can be improvements in some scenarios. For example, if you're doing experiments that don't involve failures, disabling link timeouts can be useful: discovery is used to find the topology, but then you don't need to worry about links timing out due to overload. Similarly, you can set the link timeout to be much longer (while keeping the LLDP cycle time the same), and use port events for removing links (with the assumption that link failures are detectable this way; valid in some but not all cases).
(I rarely use any of this stuff anymore, but when I do, it's almost always a combination of loading the topology from a file and then using port events and slow LLDP liveness probes via a modified discovery. The Mininet-like-thing I use can actually read the same topology files I use in the controller, which makes this pretty convenient.)
Hello,
I get a similar error while on a latest pulled eel and with a topology with 75 switches and 2 hosts and quite a bit of loops.
I have used the command ./pox.py forwarding.l2_learning openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down
The gist of the log can be found here
I have a topology of 75 switches with quit a bit of loops. As a pre-step to experiments, I am trying to do a net.pingAll
which would eventually do a ping between the 2 hosts involved. The ping does not go through and I get the same error as is reported earlier here.
treeping64.py
mininet example pings work without issues.
One thing the POX log doesn't show by default which would be helpful is the timestamp.
It looks like discovery is timing out links that it shouldn't, but it's hard to say when this happens and that doesn't help trying to understand why.
Some thoughts:
My observations:
time.sleep(20)
just before the pingAll, not sure if it would be helpful. The link timeout that I am using for this case is 50._handle_ConnectionUp
function. Is this behavior consistent with what you had suggested:
def _handle_ConnectionUp (self, event):
log.info("_handle_ConnectionUp Function")
my_match = of.ofp_match()
my_match.in_port = None
my_match.dl_dst = None
my_match.dl_src = None
my_match.dl_vlan = None
my_match.dl_type=None
my_match.nw_tos=None
my_match.nw_proto=None
my_match.nw_src=None
my_match.nw_dst=None
my_match.tp_src=None
my_match.tp_dst=None
msg=of.ofp_flow_mod(command=of.OFPFC_ADD,
hard_timeout=50,
match=my_match)
event.connection.send(msg)
inband=False
in the switch declaration. How do i ensure OVS fail mode is secure?
With the above changes, I still get the same behavior as I used to.
Thanks for following up. Yes, your connection up handler looks fine (though note that an easier way would be not mentioning the match at all -- it defaults to all wildcards).
time.sleep(20) is not going to be sufficient given the entry you installed in your connection up handler. As a conservative test, try like... time.sleep(70).
And with the new changes, can you try capturing the log again? In particular, we are interested to see if and when discovery is still timing out links. It would be helpful to include timestamps. Here's a starting place for a log format based on an example in the POX manual:
log --format="[%(asctime)s %(name)s %(levelname)s] %(message)s" --datefmt="%H:%M:%S"
Also, is your topology easy to scale? If so, is there some size at which it works fine and some size at which it starts to fail? And can you describe your topology (so that one might attempt to recreate your failing experiment)?
The updated log can be found here. The command used was ./pox.py forwarding.l2_learning openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down ext.coronet log --format="[%(asctime)s] %(message)s" --datefmt="%H:%M:%S"
My topology represents that of an ISP in the US called Coronet and looks like the below:
The JSON file that possesses the adjacency matrix and the Mininet script that I use can be found here
I will try to work with getting results for scaling the nodes and get back to you.
Some results regarding scaling of nodes here: 10 nodes - 0 pings 20 nodes - pings happened 25 nodes - 0 pings 30 nodes - 0 pings
My understanding is that you aren't intentionally having a bunch of dns queries or anything at this point, but looking at the log, there are quite a few. It quickly balloons from 25 per second right when the "quiet" table entry expires to over 4,000 eight seconds later: 25 [14:12:21] 386 [14:12:22] 480 [14:12:23] 749 [14:12:24] 667 [14:12:25] 1045 [14:12:26] 1888 [14:12:27] 3430 [14:12:28] 4311 [14:12:29]
It seems very likely that these are replicated duplicates. The most logical explanation is that a proper spanning tree is not being established. The likely reasons would seem to be 1) failure to discover the topology properly, 2) an algorithmic problem computing the spanning tree or, 3) a problem "executing" the tree (that is, something like a problem disabling the appropriate ports).
I'm guessing it's not 1, but this should be easily verifiable by looking at the link detect messages in the beginning of the log and comparing against the adjacency matrix.
2 would be easier to evaluate if we had more info. Line 108 in spanning_tree.py has an "if" statement that always evaluates to False. Can you flip it to True and set the POX log level to DEBUG and rerun the experiment?
Reran the case after flipping the if statement to True and the log can be found here
Also, the link detect messages seem to match the number of switches, so it might not be 1)
A couple comments...
The beginning of the most recent log file is missing the beginning, which isn't ideal, since I don't think all the link discoveries are in it. Do you have the beginning?
This is relatively minor, but the log component invocation I gave includes more data than the one in the POX manual: log --format="[%(asctime)s %(name)s %(levelname)s] %(message)s" --datefmt="%H:%M:%S"
I haven't tested it, but I think that works. Not a huge deal, but slightly more convenient for searching.
Yes, I have the begining and updated the gist with the log. Working on getting the log with the new format.
Are you sure the topology is correct? There seem to be many more links than the figure you posted above would indicate.
The log with the new format using command ./pox.py forwarding.l2_learning openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down ext.coronet log --format="[%(asctime)s %(name)s %(levelname)s] %(message)s" --datefmt="%H:%M:%S"
can be found here
I think the topology is correct, a total of 75 switches (and 2 hosts) and 198 links.
I think there are actually twice that many (unidirectional) links. I think maybe you're wiring them up double. For example: 00-00-00-00-00-71.1 00-00-00-00-00-22.6 00-00-00-00-00-71.3 00-00-00-00-00-22.8
And: 00-00-00-00-00-71.2 00-00-00-00-00-48.4 00-00-00-00-00-71.4 00-00-00-00-00-48.6
I actually have no idea if the existing spanning_tree component deals with this case, but it wouldn't surprise me if it didn't.
(Wiring them up double like that is easy to do if you're iterating the neighbors on each node. An easy way to fix it is often to skip the link if node2 < node1.)
Sorry my mistake, there should only be 198 unidirectional links. I had considered the whole adj matrix when adding links. Made the necessary change only considering the upper triangular part of the adj matrix and the new log is provided here
It looks like we lost the DEBUG level in there, so we don't get to see the tree...
Oops, sorry again. Included the DEBUG log.level and the updated log is here
So the topology that's discovered is isomorphic with the original coronet graph. I think possibility 1 is ruled out.
The tree it comes up with does seem to be a tree on the graph... but it's incomplete. Offhand, this seems like it should cause fewer loops and not more and you seem to be having the problem of more, but either way, this looks like it may be a bug.
Can you try commenting out lines 159-164 in spanning_tree? You want it to force a tree update on every link event. The discovery rate is currently set low enough that this won't be a problem.
Incidentally, in the grand tradition of such things, I just tried this and it seemed to work just fine. Whether that's because I'm using a slightly newer POX (now pushed), because I'm not using Mininet, or just because that is always the way of such things, I couldn't possibly guess. :)
I tried commenting out lines 159-164 of spanning_tree.py
and I am still dont see pings going through.
The log is provided here
Fyi We always had problems with discovery on more than let's say 10 switches. Somehow something always times out or a packet is lost or just pox is slow enough to handle this. Our usual solution is to include own discovery component which knows exactly the topology (read it from the same file as mininet) Peter
-----Original Message----- From: "Siddharth Gangadhar" notifications@github.com Sent: 09/04/2017 19:10 To: "noxrepo/pox" pox@noreply.github.com Cc: "Peter Peresini" ppershing+github@gmail.com; "Comment" comment@noreply.github.com Subject: Re: [noxrepo/pox] Pox error and packet loss when running openflowdiscovery and spanning tree on large fat tree topologies (#139)
Hello, I get a similar error while on a latest pulled eel and with a topology with 75 switches and 2 hosts and quite a bit of loops. I have used the command ./pox.py forwarding.l2_learning openflow.discovery --eat-early-packets openflow.spanning_tree --no-flood --hold-down The gist of the log can be found here I have a topology of 75 switches with quit a bit of loops. As a pre-step to experiments, I am trying to do a net.pingAll which would eventually do a ping between the 2 hosts involved. The ping does not go through and I get the same error as is reported earlier here. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
@ppershing Would you be willing to provide an example of your custom discovery script for me to look at?
@ppershing: I think that's the right approach anyway when possible.
Though the log from @lordlabakdas looked like maybe there's a legit bug. Unfortunately, recreating it locally wasn't as easy as recreating the topology... 75 switches no problem here.
Some good news :) I am not sure what changed from the log I posted yesterday to now (other than a reboot), but I am able to ping successfully between the couple of hosts.
Well that's a start, at least!
Is your CPU usage super high (indicating a loop)?
Hi whenever I am running pox with this command "pox.py forwarding.l2_multi openflow.discovery openflow.spanning_tree --no-flood --hold-down" and a fat tree topology with 50-80 hosts I can't make them communicate. Some can communicate but most not. I think this is mostly due to link timeout, however when I set a larger link timeout still the problem isn't entirely solved. Is there a way to get around this?
Some of the errors I also get are: ERROR:openflow.of_01:[00-00-00-00-00-20 457] OpenFlow Error: [00-00-00-00-00-20 457] Error: header: [00-00-00-00-00-20 457] Error: version: 1 [00-00-00-00-00-20 457] Error: type: 1 (OFPT_ERROR) [00-00-00-00-00-20 457] Error: length: 36 [00-00-00-00-00-20 457] Error: xid: 112392 [00-00-00-00-00-20 457] Error: type: OFPET_BAD_REQUEST (1) [00-00-00-00-00-20 457] Error: code: OFPBRC_BUFFER_UNKNOWN (8) [00-00-00-00-00-20 457] Error: datalen: 24 [00-00-00-00-00-20 457] Error: 0000: 01 0d 00 18 00 01 b7 08 00 00 03 4f 0000 08 |...........O....| [00-00-00-00-00-20 457] Error: 0010: 00 00 00 08 ff fb 00 00 |........ |
And also l2_multi throws the error "packet arrived without flow".
Thanks.