open-hpi / openhpi

Other
6 stars 10 forks source link

openhpid process is consuming more than 10G #1960

Closed mohandev2 closed 6 years ago

mohandev2 commented 7 years ago

Hi,

The memory of openhpid process running on my machine is building up a lot of memory (more than 10G). Along with the memory consumption, the following log is observed and the process is never coming out of the re-discovery loop.

Jun 8 00:23:10 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed Jun 8 00:23:57 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55 Jun 8 00:24:02 cz2 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:923: resource RPT is NULL Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:809: Update server hot swap state failed Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_re_discover.c:176: Re-discovery of server blade failed Jun 8 00:24:11 cz2 openhpid: oa_soap: oa_soap_event.c:463: Re-discovery failed for OA 10.219.76.55

The configuration file contains the following: OPENHPI_LOG_ON_SEV = "DEBUG" OPENHPI_EVT_QUEUE_LIMIT = 10000 OPENHPI_DEL_SIZE_LIMIT = 10000 OPENHPI_DEL_SAVE = "NO" OPENHPI_DAT_SIZE_LIMIT = 0 OPENHPI_DAT_USER_LIMIT = 0 OPENHPI_DAT_SAVE = "NO" OPENHPI_PATH = "openhpi path" OPENHPI_VARPATH = "lib_path"

OPENHPI_AUTOINSERT_TIMEOUT = 0 OPENHPI_AUTOINSERT_TIMEOUT_READONLY = "NO"

Any pointers/help would be great.

Thanks, Sathish

Reported by: satishlagi

Original Ticket: openhpi/bugs/1960

mohandev2 commented 7 years ago

Original comment by: mohandev2

mohandev2 commented 7 years ago

Thank you for filing this bug Sathish. We need to fix this memory leak problem. Looks like there is some problem with the blade. More information on the bug could help. a. Which version of openhpi you are using? b. What is the version of OA firmware? c. If you know more details on the blade, please let's know. Looks like it has some unique problem that we are not handling well.

Original comment by: mohandev2

mohandev2 commented 7 years ago

Dr Mohan,

  1. OpenHPI version we are using is 3.6.1

  2. Pasting the output from ./hpiinv SAF HPI Version B.03.02 (Is this the one?) Some more info: FieldId[1] Product Name : TEXT : BladeSystem c7000 Enclosure G3 FieldId[2] Manufacturer : TEXT : HP FieldId[3] Product Versn : TEXT : 1.00 FieldId[4] Custom Field : TEXT : Telco_Status: TRUE FieldId[5] Custom Field : TEXT : Power_Type: EXTERNAL_DC FieldId[6] Custom Field : TEXT : ENCLOSURE STATUS: OK

  3. Along the memory we have seen errors like, a. Management processor status flipping (Screen shot attached for reference) - on one chassis b. Power related issue on another chassis, as in the below logs


Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5 Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5. Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:453: Failed to discover Server Blade Jun 8 10:22:13 cz2 openhpid: oasoap: oasoapdiscover.c:381: Discovery failed for active OA 10.193.69.5:443 Jun 8 10:25:34 cz2 openhpid: oasoap: oasoaputils.c:753: OA 10.193.69.5 has become Active Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:1616: Unknown Power State 1 detected for Blade at bay 5 Jun 8 10:26:29 cz2 openhpid: oasoap: oasoapdiscover.c:2325: Failed to get Server rpt for bay 5.


  1. Could it be same as "https://sourceforge.net/p/openhpi/bugs/1854/"

Just thinking aloud: Do we delete the resource-id from the RPTable, when the management processor is unresponsive? If so, the re-discovery infinite loop might have been entered just then, and thus the log in loop.

"Jun 8 00:23:57 cz2 openhpid: oasoap: oasoaprediscover.c:923: resource RPT is NULL"

Shouldn't the loop be exited when the rpt is NULL?

Thanks, Sathish

vista_oa_error1

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Sathish,

Thanks for the information provided.

From the logs, it looks like there is some problem with that particular blade. I need some more information regarding the server blade. 1) Which model of server blade is placed at bay 5 and what is the iLo Firmwrae version? 2) Did you do any iLo firmaware update on that server blade recently? 3) From when or any specific point, where this problem is seen?

I can suggest you to restart the openhpid and see.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Shyamala,

  1. Copying the info from ./hpiinv FieldId[1] Product Name : TEXT : ProLiant BL460c Gen8 FieldId[2] Manufacturer : TEXT : HP FieldId[3] Product Versn : TEXT : 2.40 Dec 02 2015 FieldId[4] Custom Field : TEXT : MP Model name = iLO4

  2. The last iLo FW update was on Dec 02, 2015

  3. We are seeing this issue from April.

As a temperory recovery mechanism we are restarting openhpid process if it crosses some limit. But, the process keeps recurring.

Also, can you help me in compiling the code, I am getting the following error:

make[4]: Leaving directory '/home/platform/openhpi-3.6.1/src' /bin/bash: LIBTOOL@: command not found Makefile:618: recipe for target 'sahpi_enum_utils.lo' failed make[3]: [sahpi_enum_utils.lo] Error 127 make[3]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils' Makefile:658: recipe for target 'all-recursive' failed make[2]: [all-recursive] Error 1 make[2]: Leaving directory '/home/platform/openhpi-3.6.1/src/utils' Makefile:603: recipe for target 'all-recursive' failed make[1]: [all-recursive] Error 1 make[1]: Leaving directory '/home/platform/openhpi-3.6.1/src' Makefile:496: recipe for target 'all' failed make: [all] Error 2

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Sathish,

Looks like this is the problem with the particular server blade. Just for pointing out exact reason, please help me with results by following below steps. 1) Remove server blade and run openhpid. See the results. 2) After some time, insert server blade back. Confirm blade is powered on & then see the problem still persists.

Coming to compilation part, from the error, looks like there is no libtool installed. Install libtool and then try to compile code. Which OS are you using?

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Sathish,

What is the version of OA firmware? If possible attach hpiinv result.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Shyamala,

It will take some time for providing the information, by pulling out the blade and restarting openhpid due to various reasons.

For hpiinv result, it has got a lot of private info, kindly share me your email-id so that can send the info privately.

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Sure.

My e-mail id is shyamala.hirepatt@hpe.com.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Shyamala,

Did you get a chance to go through the hpiinv output? You found anything interesting?

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Sathish,

I could not get much info from hpiinv log. For repeated re-discovery issue, you have to follow the steps mentioned above and get the results to figure out the exact reason.

For the memory leak issue, I have created & attached the patch. Please apply patch, test it and let me know the results.

Regards, Shyamala

1960.patch.txt

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

SaErrorT oh_add_resource(RPTable table, SaHpiRptEntryT entry, void *data, int owndata){ .. if (rptentry->data && rptentry->data != data && !rptentry->owndata) g_free(rptentry->data); rptentry->data = data; .. } rv = oh_add_resource(oh_handler->rptcache, &rpt, hotswap_state, 0);

In oh_add_resource(), the hot_swap_state pointer is stored as the data, in the RPTable. Won't de-allocing on success cause issues?

Original comment by: satishlagi

mohandev2 commented 7 years ago

From the logs provided, looks like an issue is caused by not freeing the memory in discover_server() in some error handling case. While going through code I found some more places where memory was not freed on success, not exactly causing this issue. So I covered those too.

Did you get chance to apply patch and see the results?

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Shyamala,

First of all, thank you for your support.

I have tested the patch multiple time to be sure. Every time there is a switchover of OA, the memory usage of openhpid is approximately increased by 3 MB. Following logs are printed during the same.

Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:370: could not find enum value "EVENT_NETWORK_LINK_STATUS" in "EVENT_HEARTBEAT, EVENT_ENC_STATUS, EVENT_ENC_UID, EVENT_ENC_SHUTDOWN, EVENT_ENC_INFO, EVENT_ENC_NAMES, EVENT_USER_PERMISSION, EVENT_ADMIN_RIGHTS_CHANGED, EVENT_ENC_SHUTDOWN_PENDING, EVENT_ENC_TOPOLOGY, EVENT_FAN_STATUS, EVENT_FAN_INSERTED, EVENT_FAN_REMOVED, EVENT_FAN_GROUP_STATUS, EVENT_THERMAL_STATUS, EVENT_COOLING_STATUS, EVENT_FAN_ZONE_STATUS, EVENT_PS_STATUS, EVENT_PS_INSERTED, EVENT_PS_REMOVED, EVENT_PS_REDUNDANT, EVENT_PS_OVERLOAD, EVENT_AC_FAILURE, EVENT_PS_INFO, EVENT_PS_SUBSYSTEM_STATUS, EVENT_SERVER_POWER_REDUCTION_STATUS, EVENT_INTERCONNECT_STATUS, EVENT_INTERCONNECT_RESET, EVENT_INTERCONNECT_UID, EVENT_INTERCONNECT_INSERTED, EVENT_INTERCONNECT_REMOVED, EVENT_INTERCONNECT_INFO, EVENT_INTERCONNECT_HEALTH_LED, EVENT_INTERCONNECT_THERMAL, EVENT_INTERCONNECT_CPUFAULT, EVENT_INTERCONNECT_POWER, EVENT_INTERCONNECT_PORTMAP, EVENT_BLADE_PORTMAP, EVENT_INTERCONNECT_VENDOR_BLOCK, EVENT_INTERCONNECT_HEALTH_STATE, EVENT_DEMO_MODE, EVENT_BLADE_STATUS, EVENT_BLADE_INSERTED, EVENT_BLADE_REMOVED, EVENT_BLADE_POWER_STATE, EVENT_BLADE_POWER_MGMT, EVENT_BLADE_UID, EVENT_BLADE_SHUTDOWN, EVENT_BLADE_FAULT, EVENT_BLADE_THERMAL, EVENT_BLADE_INFO, EVENT_BLADE_MP_INFO, EVENT_ILO_READY, EVENT_LCD_BUTTON, EVENT_KEYING_ERROR, EVENT_ILO_HAS_IPADDRESS, EVENT_POWER_INFO, EVENT_LCD_STATUS, EVENT_LCD_INFO, EVENT_REDUNDANCY, EVENT_ILO_DEAD, EVENT_RACK_SERVICE_STARTED, EVENT_LCD_SCREEN_REFRESH, EVENT_ILO_ALIVE, EVENT_PERSONALITY_CHECK, EVENT_BLADE_POST_COMPLETE, EVENT_BLADE_SIGNATURE_CHANGED, EVENT_BLADE_PERSONALITY_CHANGED, EVENT_BLADE_TOO_LOW_POWER, EVENT_VIRTUAL_MEDIA_STATUS, EVENT_MEDIA_DRIVE_INSERTED, EVENT_MEDIA_DRIVE_REMOVED, EVENT_MEDIA_INSERTED, EVENT_MEDIA_REMOVED, EVENT_OA_NAMES, EVENT_OA_STATUS, EVENT_OA_UID, EVENT_OA_INSERTED, EVENT_OA_REMOVED, EVENT_OA_INFO, EVENT_OA_FAILOVER, EVENT_OA_TRANSITION_COMPLETE, EVENT_OA_VCM, EVENT_NETWORK_INFO_CHANGED, EVENT_SNMP_INFO_CHANGED, EVENT_SYSLOGCLEARED, EVENT Jun 21 01:45:22 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed Jun 21 01:45:22 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:771: No bytes read Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:708: oh_ssl_read() part 1 failed Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:549: SSL connection failed Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed Jun 21 01:45:24 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:45:24 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed Jun 21 01:45:26 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection refused Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:45:26 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed Jun 21 01:45:33 cz1 openhpid: oa_soap: oa_soap_oa_event.c:187: OA switching started Jun 21 01:47:35 cz1 openhpid: ssl: oh_ssl.c:532: Socket connect failed with error: Connection timed out Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:653: oh_ssl_connect() failed Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_callsupport.c:1099: failed to communicate with OA during soap_call() Jun 21 01:47:35 cz1 openhpid: oa_soap: oa_soap_utils.c:706: Get OA status call failed Jun 21 01:48:21 cz1 openhpid: oa_soap: oa_soap_event.c:491: OA is accessible Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_oa_event.c:339: OA switching completed

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Sathish,

Form the logs, look like this is connection problem with OA. Is OA accessible? If you are able to ping OA IP, try to run openhpid without my patch too. Actually patch does not do anything with OA switchover.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Shyamala,

OA is accessible. OA calls are failing during switchover, which as per my understanding is expected. My point was memory taken by openhpid is still growing.

At the end of all the switchover we see the logs: Jun 21 01:49:08 cz1 openhpid: oa_soap: oa_soap_re_discover.c:164: Re-discovery started Jun 21 01:51:10 cz1 openhpid: oa_soap: oa_soap_re_discover.c:253: Re-discovery completed

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Sathish,

The thing is earlier whatever error messages were seen, now it is totally different, like you mentioned that every time switchover is happening. The patch given was based on earlier error messages. I need some more infomration. 1) Why OA switchover is happening every time now? Any changes in setup or connection? 2) Are you able to see the messages related to server blade RPT (like Failed to get server RPT or power state etc)? If yes, is there any difference in memory consumption with patch?

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

The reason we are not seeing the same errors is, the logs I initially shared with you were from the production and I am not allowed to test patches in production. I am patching your code changes in my local machine and checking if there is no memory leak.

  1. Switching over OA is my test case - to test if no memory leak is happening during switch over. Switchover of OA, is one of the cases, where re-discovery happens, if I am not mistaken.

I am observing memory usage increase of openhpid, when ever I force a switch over of OA.

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Ok got it. Yes you are right, re-discovery happens when there is OA Switchover. I will look into this further and update you.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Hi Sathish,

We are still working on the issue. Will get back to you with updates soon.

Regards, Shyamala

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Thank you Shyamala, we are waiting for the fix. We are observing this issue in all the openhpid processes whenever there is a switchover.

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Satish,

Sorry for the delay in providing patch. PFA the patch for memory leak issue. Please apply patch, test it on your machine and let me know your observations.

Regards, Shyamala

1960.patch.txt

Original comment by: ShyamalaH

mohandev2 commented 7 years ago

Hi Shyamala,

Thank you for the patch.

These are the steps I followed to test the patch.

  1. svn checkout http://svn.code.sf.net/p/openhpi/code/openhpi/trunk opehpid-test
  2. cd opehpid-test && ./bootstrap
  3. cd plugins && patch -p1 -i 1960.patch -- Patch applied without any errors
  4. ./configure --enable-oa_soap --enable-encryption --enable-watchdog && make
  5. Used the binary in openhpid-test/openhpid/.libs and replaced it on the destination machine. Confirmed with md5sums
  6. Restarted the openhpid
  7. Switched over OA multiple times.

However, the memory increase is observed and it is in the same range as 3MB every OA switchover.

Thanks, Sathish

Original comment by: satishlagi

mohandev2 commented 7 years ago

Hi Shyamala,

With 1960.patch, we have tested and ran openhpid with valgrind.

Command used: valgrind --leak-check=full --show-reachable=yes --error-limit=no --undef-value-errors=no --log-file=/tmp/memcheck.log /opt/bbee/thirdparty/sbin/openhpid -v -d -c /etc/openhpi.conf -p 4743 -f /opt/bbee-5.0.0-5/thirdparty/var/run/openhpid.pid.4743

The valgrind logs and process trace (verbose mode) logs are already shared. Each OA switchover is increasing 3 MB memory usage even with this patch.

Thanks and Regards,

Original comment by: anirbankind

mohandev2 commented 6 years ago

Original comment by: ShyamalaH

mohandev2 commented 6 years ago

Committed revision 7717.

Original comment by: ShyamalaH