openhpi2 / Open-HPI

Open HPI is an open source implementation of the SA Forum's Hardware Platform Interface (HPI). HPI provides an abstracted interface to managing computer hardware, typically for chassis and rack based servers
Other
3 stars 1 forks source link

openhpi OA switchover issues #2478

Closed openhpi2 closed 8 years ago

openhpi2 commented 11 years ago

Hi,

with 3.2.0 we seem to be able to observe the following phenomenon. In a BladeSystem c7000 Enclosure G2 enclosure configured like the following (note ENCLOSURE_IP_MODE and LLF are non default ones): SET OA NAME 1 hp-bladectr-1 SET IPCONFIG STATIC 1 10.65.209.28 255.255.252.0 10.65.211.254 10.65.255.201 10.11.255.27 SET NIC AUTO 1 SET OA NAME 2 hp-bladectr-1-bkp SET IPCONFIG STATIC 2 10.65.209.33 255.255.252.0 10.65.211.254 10.65.255.201 10.11.255.27 SET NIC AUTO 2 ENABLE ENCLOSURE_IP_MODE SET LLF INTERVAL 60 ENABLE LLF DISABLE ROUTER ADVERTISEMENTS DISABLE DHCPV6 DISABLE IPV6

OA Firmware Ver. : 3.71 Dec 07 2012

Once openhpi 3.2.0 is started (configured with only the active IP due to ENCLOSURE_IP_MODE set) and an OA switchover is forced at least one thread is observed to be stuck in constantly trying to rediscover the STANDBY OA like the following (note this issue takes a few tries to be reproduces, it is not a 100% hit thing): oa_soap: CRIT: oa_soap_callsupport.c:978: OA SOAP error 139: Not a valid request while running in standby mode. oa_soap: CRIT: oa_soap_re_discover.c:695: Get blade info failed oa_soap: CRIT: oa_soap_re_discover.c:172: Re-discovery of server blade failed oa_soap: CRIT: oa_soap_event.c:417: Re-discovery failed for OA 10.65.209.33

After that two things can happen:

I will attach the full log to the case. Below you will find the steps I took.

regards, Michele

1) gdb /usr/local/sbin/openhpid Reading symbols from /usr/local/sbin/openhpid...done. (gdb) set args -c /etc/openhpi/openhpi.conf -n -v >& /tmp/3.2.0-20130413.log (gdb) r

Waiting for the discovery to finish...i.e. until we see only HEARTBEAT events Forcing takeover at about Fri Apr 12 22:01:20 CEST 2013

2) We see the failure (shortly after the OA SWITCH done via "force takeover") moment at line 20751: oa_soap: DBG: oa_soap_event.c:258: getAllEvents call failed, may be due to OA switchover oa_soap: DBG: oa_soap_event.c:259: Re-try the getAllEvents SOAP call

oa_soap: DBG: oa_soap_callsupport.c:669: OA request(1): POST /hpoa HTTP/1.1 Host: 10.65.209.33:443 <-- here we try the standby OA right away !!! This is dangerous as we don't know yet which OA has that IP

We ask the standby OA the following: oa_soap: DBG: oa_soap_callsupport.c:680: OA request(2): <?xml version="1.0"?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:hpoa="hpoa.xsd"> <SOAP-ENV:Header><wsse:Security SOAP-ENV:mustUnderstand="true"> <hpoa:HpOaSessionKeyToken> <hpoa:oaSessionKey>b518615c20ceb860</hpoa:oaSessionKey> </hpoa:HpOaSessionKeyToken> </wsse:Security> </SOAP-ENV:Header> <SOAP-ENV:Body> <hpoa:getAllEvents><hpoa:pid>2837</hpoa:pid><hpoa:waitTilEventHappens>1</hpoa:waitTilEventHappens><hpoa:lcdEvents>0</hpoa:lcdEvents></hpoa:getAllEvents> </SOAP-ENV:Body> </SOAP-ENV:Envelope>

and we get a: oa_soap: DBG: oa_soap_callsupport.c:708: OA response(0): HTTP/1.1 500 Internal Server Error^M Date: Sat, 13 Apr 2013 08:57:02 GMT^M Server: Apache^M Connection: close^M Content-Length: 1299^M Content-Type: application/soap+xml; charset=utf-8^M ^M

oa_soap: DBG: oa_soap_callsupport.c:728: OA response(1): <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://www.w3.org/2003/05/soap-envelope" xmlns:SOAP-ENC="http://www.w3.org/2003/05/soap-encoding" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:hpoa="hpoa.xsd"> <SOAP-ENV:Header> <wsse:Security> <hpoa:HpOaSessionKeyToken> <hpoa:oaSessionKey>b518615c20ceb860</hpoa:oaSessionKey> </hpoa:HpOaSessionKeyToken> </wsse:Security> </SOAP-ENV:Header> <SOAP-ENV:Body> <SOAP-ENV:Fault> <SOAP-ENV:Code> <SOAP-ENV:Value>SOAP-ENV:Receiver</SOAP-ENV:Value> </SOAP-ENV:Code> <SOAP-ENV:Reason> <SOAP-ENV:Text>Onboard Administrator Error</SOAP-ENV:Text> </SOAP-ENV:Reason> <SOAP-ENV:Detail> <hpoa:faultInfo> <hpoa:errorType>ONBOARD_ADMINISTRATOR</hpoa:errorType> <hpoa:errorCode>201</hpoa:errorCode> <hpoa:operationName>getAllEvents</hpoa:operationName> <hpoa:errorText>Could not open event pipe for reading.</hpoa:errorText> </hpoa:faultInfo> </SOAP-ENV:Detail> </SOAP-ENV:Fault> </SOAP-ENV:Body> </SOAP-ENV:Envelope>

After a couple of the above,we do around line 20898 : Host: 10.65.209.33:443 <hpoa:getOaStatus><hpoa:bayNumber>2</hpoa:bayNumber></hpoa:getOaStatus> <-- So here we seem to ask the ip address 10.65.209.33 (is it still the ip of bay 2 probably not?) for a status of oa nr 2.

and we get: <hpoa:getOaStatusResponse> <hpoa:oaStatus> <hpoa:bayNumber>2</hpoa:bayNumber> <hpoa:oaName>hp-bladectr-1-bkp</hpoa:oaName> <-- We get the reply that OA nr. 2 is ACTIVE now (this is probably OA 1 telling us) <hpoa:oaRole>ACTIVE</hpoa:oaRole>

oa_soap: CRIT: oa_soap_utils.c:702: OA 10.65.209.33 has become Active (line 20977)

So my fear here is that we opened a connection with 10.65.209.33 and it tells us correctly that OA 2 is ACTIVE. But we keep connecting to OA 1 (because we probably still the wrong IP address)

So now we have the Rediscovery and this I think confirms my theory: 21028 oa_soap: CRIT: oa_soap_re_discover.c:140: Re-discovery started 21029 oa_soap: DBG: oa_soap_callsupport.c:669: OA request(1): 21030 POST /hpoa HTTP/1.1 21031 Host: 10.65.209.33:443 <-- Rediscovery happens by talking to this IP which at this point probably belongs to OA nr 1???? 21032 Content-Type: application/soap+xml; charset="utf-8" 21033 Content-Length: 749 21034 21035 21036 21037 oa_soap: DBG: oa_soap_callsupport.c:680: OA request(2): 21038 <?xml version="1.0"?> 21039 <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://ww w.w3.org/2001/XMLSchema" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis- open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:hpoa="hpoa.xsd"> 21040 <SOAP-ENV:Header><wsse:Security SOAP-ENV:mustUnderstand="true"> 21041 <hpoa:HpOaSessionKeyToken> 21042 <hpoa:oaSessionKey>b518615c20ceb860</hpoa:oaSessionKey> 21043 </hpoa:HpOaSessionKeyToken> 21044 </wsse:Security> 21045 </SOAP-ENV:Header> 21046 <SOAP-ENV:Body> 21047 <hpoa:getBladeInfo><hpoa:bayNumber>1</hpoa:bayNumber></hpoa:getBladeInfo> 21048 </SOAP-ENV:Body> 21049 </SOAP-ENV:Envelope>

And at this point we are told off by the OA: 21052 oa_soap: DBG: oa_soap_callsupport.c:708: OA response(0): 21053 HTTP/1.1 500 Internal Server Error^M 21054 Date: Sat, 13 Apr 2013 08:57:16 GMT^M 21055 Server: Apache^M 21056 Connection: close^M 21057 Content-Length: 1370^M 21058 Content-Type: application/soap+xml; charset=utf-8^M 21059 ^M 21060 21061 21062 oa_soap: DBG: oa_soap_callsupport.c:728: OA response(1): 21063 <?xml version="1.0" encoding="UTF-8"?> 21064 <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://www.w3.org/2003/05/soap-envelope" xmlns:SOAP-ENC="http://www.w3.org/2003/05/soap-encoding" xmlns:xsi="http:/ /www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-w ssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:hpoa="hpoa.xsd"> 21065 <SOAP-ENV:Header> 21066 <wsse:Security> 21067 <hpoa:HpOaSessionKeyToken> 21068 <hpoa:oaSessionKey>b518615c20ceb860</hpoa:oaSessionKey> 21069 </hpoa:HpOaSessionKeyToken> 21070 </wsse:Security> 21071 </SOAP-ENV:Header> 21072 <SOAP-ENV:Body> 21073 <SOAP-ENV:Fault> 21074 <SOAP-ENV:Code> 21075 <SOAP-ENV:Value>SOAP-ENV:Receiver</SOAP-ENV:Value> 21076 </SOAP-ENV:Code> 21077 <SOAP-ENV:Reason> 21078 <SOAP-ENV:Text>Onboard Administrator Error</SOAP-ENV:Text> 21079 </SOAP-ENV:Reason> 21080 <SOAP-ENV:Detail> 21081 <hpoa:faultInfo> 21082 <hpoa:errorType>ONBOARD_ADMINISTRATOR</hpoa:errorType> 21083 <hpoa:errorCode>139</hpoa:errorCode> 21084 <hpoa:operationName>getBladeInfo</hpoa:operationName> 21085 <hpoa:operationBayNumber>01</hpoa:operationBayNumber> 21086 <hpoa:errorText>Not a valid request while running in standby mode.</hpoa:errorText> 21087 </hpoa:faultInfo> 21088 </SOAP-ENV:Detail> 21089 </SOAP-ENV:Fault> 21090 </SOAP-ENV:Body> 21091 </SOAP-ENV:Envelope>

21093 oa_soap: CRIT: oa_soap_callsupport.c:978: OA SOAP error 139: Not a valid request while running in standby mode. 21094 oa_soap: CRIT: oa_soap_re_discover.c:695: Get blade info failed 21095 oa_soap: CRIT: oa_soap_re_discover.c:172: Re-discovery of server blade failed 21096 oa_soap: CRIT: oa_soap_event.c:417: Re-discovery failed for OA 10.65.209.33

This happens a couple of times until we decide to check with the master IP again and do the re-discovery: 21428 oa_soap: CRIT: oa_soap_re_discover.c:140: Re-discovery started 21429 oa_soap: DBG: oa_soap_callsupport.c:669: OA request(1): 21430 POST /hpoa HTTP/1.1 21431 Host: 10.65.209.28:443 21432 Content-Type: application/soap+xml; charset="utf-8" 21433 Content-Length: 773

Reported by: mbaldessari

openhpi2 commented 11 years ago

Full debug logs 3.2.0-20130412.log.gz

Original comment by: mbaldessari

openhpi2 commented 11 years ago

Thanks for repoting the byg. We see the standby OA messages consistently. We are working on fixing it. We are looking at the working of the enclosure IP mode also. Do you use the OpenHPI-3.2.0 released on the RHEL distribution?

Original comment by: dr_mohan

openhpi2 commented 11 years ago

Hi,

yes this is on RHEL 6.3 x64

regards, M.

Original comment by: mbaldessari

openhpi2 commented 11 years ago

Fix for #3610943 3610943.patch.txt

Original comment by: hemanthreddy

openhpi2 commented 11 years ago

Hi Michelle,

PFA the patch for this issue. We have tested the patch, and it looks good. Could you please Test the patch at your end.

Thanks, Hemantha Reddy

Original comment by: hemanthreddy

openhpi2 commented 11 years ago

Hi Hemantha,

I tested openhpi with the patch applied. Same config as in the case description: ENABLE ENCLOSURE_IP_MODE SET LLF INTERVAL 60

forced an OA switchover and did not see any failed messages beyond the network related ones during the switchover. Logs correctly show: oa_soap: DBG: oa_soap_oa_event.c:564: Enclosure IP Mode is Enabled

And after the switchover the usual commands work correctly (hpi_shell, hpiinv). So ACK to the patch from my side.

regards, Michele

Original comment by: mbaldessari

openhpi2 commented 11 years ago

Thanks Michele, We will ckeck in this patch soon. and close this issue.

Thanks, Hemantha Reddy

Original comment by: hemanthreddy

openhpi2 commented 11 years ago

Fixed in trunk Rev #7537.

Original comment by: hemanthreddy

openhpi2 commented 11 years ago

Original comment by: hemanthreddy

openhpi2 commented 11 years ago

Original comment by: dr_mohan

openhpi2 commented 11 years ago

*_ATTENTION_** This account is disabled and is no longer accessed by the recipient. Please remove it from your address book.

Thanks

Original comment by: tariqx