Closed mohandev2 closed 6 years ago
Original comment by: mohandev2
Thank you for filing this bug Sathish. We need to fix this memory leak problem. Looks like there is some problem with the blade. More information on the bug could help. a. Which version of openhpi you are using? b. What is the version of OA firmware? c. If you know more details on the blade, please let's know. Looks like it has some unique problem that we are not handling well.
Original comment by: mohandev2
Dr Mohan,
OpenHPI version we are using is 3.6.1
Pasting the output from ./hpiinv SAF HPI Version B.03.02 (Is this the one?) Some more info: FieldId[1] Product Name : TEXT : BladeSystem c7000 Enclosure G3 FieldId[2] Manufacturer : TEXT : HP FieldId[3] Product Versn : TEXT : 1.00 FieldId[4] Custom Field : TEXT : Telco_Status: TRUE FieldId[5] Custom Field : TEXT : Power_Type: EXTERNAL_DC FieldId[6] Custom Field : TEXT : ENCLOSURE STATUS: OK
Along the memory we have seen errors like, a. Management processor status flipping (Screen shot attached for reference) - on one chassis b. Power related issue on another chassis, as in the below logs
Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5 Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5. Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:453: Failed to discover Server Blade Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:381: Discovery failed for active OA 10.193.69.5:443 Jun 8 10:25:34 cz2 openhpid: oasoap: oasoaputils.c:753: OA 10.193.69.5 has become Active Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5 Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.
Just thinking aloud: Do we delete the resource-id from the RPTable, when the management processor is unresponsive? If so, the re-discovery infinite loop might have been entered just then, and thus the log in loop.
"Jun 8 00:23:57 cz2 openhpid: oasoap: oasoaprediscover.c:923: resource RPT is NULL"
Shouldn't the loop be exited when the rpt is NULL?
Thanks, Sathish
Original comment by: satishlagi
Hi Sathish,
Thanks for the information provided.
From the logs, it looks like there is some problem with that particular blade. I need some more information regarding the server blade. 1) Which model of server blade is placed at bay 5 and what is the iLo Firmwrae version? 2) Did you do any iLo firmaware update on that server blade recently? 3) From when or any specific point, where this problem is seen?
I can suggest you to restart the openhpid and see.
Regards, Shyamala
Original comment by: ShyamalaH
Shyamala,
Copying the info from ./hpiinv FieldId[1] Product Name : TEXT : ProLiant BL460c Gen8 FieldId[2] Manufacturer : TEXT : HP FieldId[3] Product Versn : TEXT : 2.40 Dec 02 2015 FieldId[4] Custom Field : TEXT : MP Model name = iLO4
The last iLo FW update was on Dec 02, 2015
We are seeing this issue from April.
As a temperory recovery mechanism we are restarting openhpid process if it crosses some limit. But, the process keeps recurring.
Also, can you help me in compiling the code, I am getting the following error:
make[4]: Leaving directory '/home/platform/openhpi-3.6.1/src' /bin/bash: LIBTOOL@: command not found Makefile:618: recipe for target 'sahpi_enum_utils.lo' failed make[3]: [sahpi_enum_utils.lo] Error 127 make[3]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils' Makefile:658: recipe for target 'all-recursive' failed make[2]: [all-recursive] Error 1 make[2]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils' Makefile:603: recipe for target 'all-recursive' failed make[1]: [all-recursive] Error 1 make[1]: Leaving directory '/home/platform/openhpi-3.6.1/src' Makefile:496: recipe for target 'all' failed make: [all] Error 2
Thanks, Sathish
Original comment by: satishlagi
Hi Sathish,
Looks like this is the problem with the particular server blade. Just for pointing out exact reason, please help me with results by following below steps. 1) Remove server blade and run openhpid. See the results. 2) After some time, insert server blade back. Confirm blade is powered on & then see the problem still persists.
Coming to compilation part, from the error, looks like there is no libtool installed. Install libtool and then try to compile code. Which OS are you using?
Regards, Shyamala
Original comment by: ShyamalaH
Sathish,
What is the version of OA firmware? If possible attach hpiinv result.
Regards, Shyamala
Original comment by: ShyamalaH
Shyamala,
It will take some time for providing the information, by pulling out the blade and restarting openhpid due to various reasons.
For hpiinv result, it has got a lot of private info, kindly share me your email-id so that can send the info privately.
Thanks, Sathish
Original comment by: satishlagi
Sure.
My e-mail id is shyamala.hirepatt@hpe.com.
Regards, Shyamala
Original comment by: ShyamalaH
Shyamala,
Did you get a chance to go through the hpiinv output? You found anything interesting?
Thanks, Sathish
Original comment by: satishlagi
Hi Sathish,
I could not get much info from hpiinv log. For repeated re-discovery issue, you have to follow the steps mentioned above and get the results to figure out the exact reason.
For the memory leak issue, I have created & attached the patch. Please apply patch, test it and let me know the results.
Regards, Shyamala
Original comment by: ShyamalaH
SaErrorT oh_add_resource(RPTable table, SaHpiRptEntryT entry, void *data, int owndata){ .. if (rptentry->data && rptentry->data != data && !rptentry->owndata) g_free(rptentry->data); rptentry->data = data; .. } rv = oh_add_resource(oh_handler->rptcache, &rpt, hotswap_state, 0);
In oh_add_resource(), the hot_swap_state pointer is stored as the data, in the RPTable. Won't de-allocing on success cause issues?
Original comment by: satishlagi
From the logs provided, looks like an issue is caused by not freeing the memory in discover_server() in some error handling case. While going through code I found some more places where memory was not freed on success, not exactly causing this issue. So I covered those too.
Did you get chance to apply patch and see the results?
Original comment by: ShyamalaH
Shyamala,
First of all, thank you for your support.
I have tested the patch multiple time to be sure. Every time there is a switchover of OA, the memory usage of openhpid is approximately increased by 3 MB. Following logs are printed during the same.
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:370: could not find enum value "EVENT_NETWORK_LINK_STATUS" in "EVENT_HEARTBEAT, EVENT_ENC_STATUS, EVENT_ENC_UID, EVENT_ENC_SHUTDOWN, EVENT_ENC_INFO, EVENT_ENC_NAMES, EVENT_USER_PERMISSION, EVENT_ADMIN_RIGHTS_CHANGED, EVENT_ENC_SHUTDOWN_PENDING, EVENT_ENC_TOPOLOGY, EVENT_FAN_STATUS, EVENT_FAN_INSERTED, EVENT_FAN_REMOVED, EVENT_FAN_GROUP_STATUS, EVENT_THERMAL_STATUS, EVENT_COOLING_STATUS, EVENT_FAN_ZONE_STATUS, EVENT_PS_STATUS, EVENT_PS_INSERTED, EVENT_PS_REMOVED, EVENT_PS_REDUNDANT, EVENT_PS_OVERLOAD, EVENT_AC_FAILURE, EVENT_PS_INFO, EVENT_PS_SUBSYSTEM_STATUS, EVENT_SERVER_POWER_REDUCTION_STATUS, EVENT_INTERCONNECT_STATUS, EVENT_INTERCONNECT_RESET, EVENT_INTERCONNECT_UID, EVENT_INTERCONNECT_INSERTED, EVENT_INTERCONNECT_REMOVED, EVENT_INTERCONNECT_INFO, EVENT_INTERCONNECT_HEALTH_LED, EVENT_INTERCONNECT_THERMAL, EVENT_INTERCONNECT_CPUFAULT, EVENT_INTERCONNECT_POWER, EVENT_INTERCONNECT_PORTMAP, EVENT_BLADE_PORTMAP, EVENT_INTERCONNECT_VENDOR_BLOCK, EVENT_INTERCONNECT_HEALTH_STATE, EVENT_DEMO_MODE, EVENT_BLADE_STATUS, EVENT_BLADE_INSERTED, EVENT_BLADE_REMOVED, EVENT_BLADE_POWER_STATE, EVENT_BLADE_POWER_MGMT, EVENT_BLADE_UID, EVENT_BLADE_SHUTDOWN, EVENT_BLADE_FAULT, EVENT_BLADE_THERMAL, EVENT_BLADE_INFO, EVENT_BLADE_MP_INFO, EVENT_ILO_READY, EVENT_LCD_BUTTON, EVENT_KEYING_ERROR, EVENT_ILO_HAS_IPADDRESS, EVENT_POWER_INFO, EVENT_LCD_STATUS, EVENT_LCD_INFO, EVENT_REDUNDANCY, EVENT_ILO_DEAD, EVENT_RACK_SERVICE_STARTED, EVENT_LCD_SCREEN_REFRESH, EVENT_ILO_ALIVE, EVENT_PERSONALITY_CHECK, EVENT_BLADE_POST_COMPLETE, EVENT_BLADE_SIGNATURE_CHANGED, EVENT_BLADE_PERSONALITY_CHANGED, EVENT_BLADE_TOO_LOW_POWER, EVENT_VIRTUAL_MEDIA_STATUS, EVENT_MEDIA_DRIVE_INSERTED, EVENT_MEDIA_DRIVE_REMOVED, EVENT_MEDIA_INSERTED, EVENT_MEDIA_REMOVED, EVENT_OA_NAMES, EVENT_OA_STATUS, EVENT_OA_UID, EVENT_OA_INSERTED, EVENT_OA_REMOVED, EVENT_OA_INFO, EVENT_OA_FAILOVER, EVENT_OA_TRANSITION_COMPLETE, EVENT_OA_VCM, EVENT_NETWORK_INFO_CHANGED, EVENT_SNMP_INFO_CHANGED, EVENT_SYSLOGCLEARED, EVENT
Jun 21 01:45:22 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:771: No bytes read
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:708: oh_ssl_read() part 1 failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:549: SSL connection failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:26 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:45:33 cz1 openhpid: oa_soap: oa_soap_oa_event.c:187: OA switching started
Jun 21 01:47:35 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection timed out
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call()
Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed
Jun 21 01:48:21 cz1 openhpid: oa_soap: oa_soap_event.c:491: OA
Thanks, Sathish
Original comment by: satishlagi
Hi Sathish,
Form the logs, look like this is connection problem with OA. Is OA accessible? If you are able to ping OA IP, try to run openhpid without my patch too. Actually patch does not do anything with OA switchover.
Regards, Shyamala
Original comment by: ShyamalaH
Shyamala,
OA is accessible. OA calls are failing during switchover, which as per my understanding is expected. My point was memory taken by openhpid is still growing.
At the end of all the switchover we see the logs: Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed
Thanks, Sathish
Original comment by: satishlagi
Hi Sathish,
The thing is earlier whatever error messages were seen, now it is totally different, like you mentioned that every time switchover is happening. The patch given was based on earlier error messages. I need some more infomration. 1) Why OA switchover is happening every time now? Any changes in setup or connection? 2) Are you able to see the messages related to server blade RPT (like Failed to get server RPT or power state etc)? If yes, is there any difference in memory consumption with patch?
Regards, Shyamala
Original comment by: ShyamalaH
The reason we are not seeing the same errors is, the logs I initially shared with you were from the production and I am not allowed to test patches in production. I am patching your code changes in my local machine and checking if there is no memory leak.
I am observing memory usage increase of openhpid, when ever I force a switch over of OA.
Thanks, Sathish
Original comment by: satishlagi
Ok got it. Yes you are right, re-discovery happens when there is OA Switchover. I will look into this further and update you.
Regards, Shyamala
Original comment by: ShyamalaH
Hi Sathish,
We are still working on the issue. Will get back to you with updates soon.
Regards, Shyamala
Original comment by: ShyamalaH
Thank you Shyamala, we are waiting for the fix. We are observing this issue in all the openhpid processes whenever there is a switchover.
Thanks, Sathish
Original comment by: satishlagi
Hi Satish,
Sorry for the delay in providing patch. PFA the patch for memory leak issue. Please apply patch, test it on your machine and let me know your observations.
Regards, Shyamala
Original comment by: ShyamalaH
Hi Shyamala,
Thank you for the patch.
These are the steps I followed to test the patch.
However, the memory increase is observed and it is in the same range as 3MB every OA switchover.
Thanks, Sathish
Original comment by: satishlagi
Hi Shyamala,
With 1960.patch, we have tested and ran openhpid with valgrind.
Command used: valgrind --leak-check=full --show-reachable=yes --error-limit=no --undef-value-errors=no --log-file=/tmp/memcheck.log /opt/bbee/thirdparty/sbin/openhpid -v -d -c /etc/openhpi.conf -p 4743 -f /opt/bbee-5.0.0-5/thirdparty/var/run/openhpid.pid.4743
The valgrind logs and process trace (verbose mode) logs are already shared. Each OA switchover is increasing 3 MB memory usage even with this patch.
Thanks and Regards,
Original comment by: anirbankind
Original comment by: ShyamalaH
Committed revision 7717.
Original comment by: ShyamalaH
Hi,
The memory of openhpid process running on my machine is building up a lot of memory (more than 10G). Along with the memory consumption, the following log is observed and the process is never coming out of the re-discovery loop.
Jun 8 00:23:10 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55 Jun 8 00:24:02 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55
The configuration file contains the following: OPENHPI_LOG_ON_SEV = "DEBUG" OPENHPI_EVT_QUEUE_LIMIT = 10000 OPENHPI_DEL_SIZE_LIMIT = 10000 OPENHPI_DEL_SAVE = "NO" OPENHPI_DAT_SIZE_LIMIT = 0 OPENHPI_DAT_USER_LIMIT = 0 OPENHPI_DAT_SAVE = "NO" OPENHPI_PATH = "openhpi path" OPENHPI_VARPATH = "lib_path"
OPENHPI_AUTOINSERT_TIMEOUT = 0 OPENHPI_AUTOINSERT_TIMEOUT_READONLY = "NO"
Any pointers/help would be great.
Thanks, Sathish
Reported by: satishlagi
Original Ticket: openhpi/bugs/1960