opennetworkinglab / flowvisor

FlowVisor - A network hypervisor
Other
164 stars 67 forks source link

FlowVisor crash: java.lang.ArrayIndexOutOfBoundsException #254

Open jbsbbn opened 11 years ago

jbsbbn commented 11 years ago

The production FlowVisor at GENI regional network provider Southern Crossroads (SoX) crashed recently. In syslog:

Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR slicer_cfece52e-c0ec-45e6-9781-54b0c385eab4_dpid=0f:fa:34:40:b5:03:14:00 : STARVING: handling event took 4130ms: org.flowvisor.events.FVIOEvent@42ef4994 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: WARN slicer_1c4c697d-44e4-493f-9f09-c1c96b3d0487_dpid=0f:fa:34:40:b5:03:14:00 : Verifying Slice is not over its flow rule limit
Jul  7 01:05:47  1>Jul last message repeated 16 times
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : MAIN THREAD DIED!!! 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ---------------------------- 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ----- exception in thread Thread[main,5,main]::java.lang.ArrayIndexOutOfBoundsException: 57 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFType.valueOf(OFType.java:185) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFMessage.readFrom(OFMessage.java:130) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.factory.BasicFactory.parseMessages(BasicFactory.java:49) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:54) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.io.FVMessageAsyncStream.read(FVMessageAsyncStream.java:86) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:44) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleIOEvent(FVClassifier.java:460) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleEvent(FVClassifier.java:379) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.events.FVEventLoop.doEventLoop(FVEventLoop.java:168) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.run(FlowVisor.java:196) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.main(FlowVisor.java:229) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ---------------------------- 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : restarting after main thread died 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: WARN classifier-dpid=0f:fa:34:40:b5:03:14:00 : tearing down
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing poll loop
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI JSONRPC SSL WebServer on port 8081
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO OFSwitchAcceptor : Listenning on port 6633
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI XMLRPC SSL WebServer on port 8080
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: FATAL none : failed to spawn APIServer 
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : Shutting down config database.
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : Shutting down config database.

They restarted it, and it came back up fine.

What other info can we gather that might help?

alshabib commented 11 years ago

Would it be possible to find the actual message that generated this crash?

The crash occurred in openflowj, which is the library responsible for marshalling/unmarshalling openflow messages. It crashed while trying to parse a message with type value 57 which is strange because openflow (at least standard of) does not have a message with that type value.

All that said, openflowj should not crash when such a message occurs. I'll escalate this to openflowj's maintainer.

On Jul 8, 2013, at 10:37 AM, Josh Smift notifications@github.com wrote:

The production FlowVisor at GENI regional network provider Southern Crossroads (SoX) crashed recently. In syslog:

Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR slicer_cfece52e-c0ec-45e6-9781-54b0c385eab4_dpid=0f:fa:34:40:b5:03:14:00 : STARVING: handling event took 4130ms: org.flowvisor.events.FVIOEvent@42ef4994 Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: WARN slicer_1c4c697d-44e4-493f-9f09-c1c96b3d0487_dpid=0f:fa:34:40:b5:03:14:00 : Verifying Slice is not over its flow rule limit Jul 7 01:05:47 1>Jul last message repeated 16 times Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : MAIN THREAD DIED!!! Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ---------------------------- Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ----- exception in thread Thread[main,5,main]::java.lang.ArrayIndexOutOfBoundsException: 57 Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFType.valueOf(OFType.java:185) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFMessage.readFrom(OFMessage.java:130) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.factory.BasicFactory.parseMessages(BasicFactory.java:49) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:54) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.io.FVMessageAsyncStream.read(FVMessageAsyncStream.java:86) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:44) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleIOEvent(FVClassifier.java:460) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleEvent(FVClassifier.java:379) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.events.FVEventLoop.doEventLoop(FVEventLoop.java:168) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.run(FlowVisor.java:196) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.main(FlowVisor.java:229) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ---------------------------- Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : restarting after main thread died Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: WARN classifier-dpid=0f:fa:34:40:b5:03:14:00 : tearing down Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing poll loop Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI JSONRPC SSL WebServer on port 8081 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO OFSwitchAcceptor : Listenning on port 6633 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI XMLRPC SSL WebServer on port 8080 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: FATAL none : failed to spawn APIServer Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : Shutting down config database. Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : Shutting down config database.

They restarted it, and it came back up fine.

What other info can we gather that might help?

— Reply to this email directly or view it on GitHub.

jbsbbn commented 11 years ago

AA> Would it be possible to find the actual message that generated this crash?

Mm, maybe. To clarify, by "actual message", you mean like what API call was made immediately before the crash?

Would that be in syslog? If not, where would we find it?

alshabib commented 11 years ago

No I was hoping for the actual openflow message. That's probably a long shot but you never know ;)

Do you know what OF devices are connected to that FlowVisor?

On Jul 8, 2013, at 10:55 AM, Josh Smift notifications@github.com wrote:

AA> Would it be possible to find the actual message that generated this crash?

Mm, maybe. To clarify, by "actual message", you mean like what API call was made immediately before the crash?

Would that be in syslog? If not, where would we find it? — Reply to this email directly or view it on GitHub.

jbsbbn commented 11 years ago

AA> No I was hoping for the actual openflow message. That's probably a long shot but you never know ;)

Heh. I don't think anyone would have that, unless SoX happened to be running a packet capture at that time. I'll ask just in case.

AA> Do you know what OF devices are connected to that FlowVisor?

Just one NEC PF5820 at the moment.

nbastin commented 11 years ago

Type 57 may be the vendor message NEC uses for big stats replies. I'm not 100% sure, as I haven't been recording their message types, but they use a lot of custom message types as extensions to support their flowtable being larger than what a lot of OF messages can handle by default (various datatypes are not large enough, etc.).

alshabib commented 11 years ago

Sure but the type should be OFPT_VENDOR and not 57 or something crazy.

On Jul 8, 2013, at 11:06 AM, Nick Bastin notifications@github.com wrote:

Type 57 may be the vendor message NEC uses for big stats replies. I'm not 100% sure, as I haven't been recording their message types, but they use a lot of custom message types as extensions to support their flowtable being larger than what a lot of OF messages can handle by default (various datatypes are not large enough, etc.).

— Reply to this email directly or view it on GitHub.

jbsbbn commented 11 years ago

Sure, but even if the switch is doing something totally wrong with the type (which I don't know enough about this to say whether it is or not), FV shouldn't crash as a result.

alshabib commented 11 years ago

Agreed, I am filing a report on OpenFlowJ. Would you happen to have a packet capture?

jbsbbn commented 11 years ago

Sorry, I thought I'd mentioned this earlier: No, no one happened to be running a packet capture at the time of the crash.

Can you reproduce the bug? Can you get a package capture that way?

Can you do anything to ameliorate this on the FV side, so that FV doesn't die even when it encounters a bug in openflowj?

alshabib commented 11 years ago

Sure, we shouldn't crash when ofj crashes although if a packet isn't parsed correctly for some reason it could cause incorrect fv behaviour for some pathological cases. I don't know what the correct behaviour is here. Need to think about that some more.

Ali Al-Shabibi (sent from handheld)

On 11 juil. 2013, at 18:07, Josh Smift notifications@github.com wrote:

Sorry, I thought I'd mentioned this earlier: No, no one happened to be running a packet capture at the time of the crash.

Can you reproduce the bug? Can you get a package capture that way?

Can you do anything to ameliorate this on the FV side, so that FV doesn't die even when it encounters a bug in openflowj?

— Reply to this email directly or view it on GitHub.

nbastin commented 11 years ago

Drop the packet. You don't know how to slice it, and you're promising to slice the control plane. There's nothing you can do except drop it and log it.