Closed tnelson closed 10 years ago
hey Tim,
thanks for discovering a way to reproduce this! it's been nagging at me since I hate when the testbed network crashes randomly. :-(
while I haven't figured out why LWT is still consuming the stack traces (even though I thought we fixed that...), I did see the following helpful line when I do the test:
[platform0x01] error sending: Connection reset by peer (in write)
basically, the switch is disconnecting [1] and while ocaml-openflow is catching the exception on write
, it does not catch the exception on the subsequent call to close
the socket. I will have a patch shortly, but this is symptomatic of a larger problem, which is that ocaml-openflow has been debugged for OpenFlow 1.3 switches, but not for OpenFlow 1.0 switches, so it may take me a bit longer.
Andrew
[1] why is the switch disconnecting? if we look at /var/log/openvswitch/ovs-vswitchd.log
we see many many lines like:
Nov 19 16:20:54|860915|dpif|WARN|system@s1: recv failed (No buffer space available)
which, according to this thread, http://openvswitch.org/pipermail/discuss/2012-April/006999.html, is due to the high rate of packet arrivals. updating the version of Open vSwitch we use should help with this. (we are using version 1.4.3 in the VM and on the testbed. the above referenced patch should be in 1.7; the latest are 1.9.3, 1.11, and 2.0)
ok, I have tried lots and lots of ways to catch the error which comes from the call to close the file descriptor. at this point, I suspect a bug in Lwt. :-(
for now, I have made this change to ocaml-openflow https://github.com/adferguson/ocaml-openflow/commit/31c323b1a32c6bb9ae8705d549f8dbe51e4f2f17 to comment-out the call to close the file descriptor
Full error:
Fatal error: exception Unix.Unix_error(Unix.ECONNRESET, "write", "") Raised by primitive operation at file "src/unix/lwt_unix.ml", line 673, characters 43-71
Called from file "src/unix/lwt_unix.ml", line 556, characters 14-23
Can be reproduced with: ./flowlog.native -verbose 1 examples/Arp_Cache.flg
sudo mn --controller=remote --topo=tree,fanout=8,depth=1 --mac --arp xterm h1 arping -w 5000 h2
At roughly 15000 packets, should produce the error.
Something odd:
arping -w 1000000 h2 sends one ping per second, and proceeds normally. But:
arping -w 500000 h2 seconds MANY MANY pings per second, overloading the controller rather like we saw at -w 5000. What's going on here?