Closed openhpi2 closed 8 years ago
The main thread creates three more threads to handle the discovery, evtget and evtpop. All these four threads run all the time. In addition the main thread creates many service threads to handle the client connections. These threads are created as needed and they exit once their job is done. When the user wants to stop the daemon, he issues a stop signal. Only main thread receives the signal. The main thread sends information to other threads and makes sure the program terminates properly. Ideally, service threads need to close first, followed by event threads, followed by discovery thread and main thread at the end.
The main thread opens the plugins and closes them. The discovery thread discovers the resources. Event thread manages the events. Open, Close, Discover and many other functions are defined at the plugin level. oa_soap plugin does not use the infrastructure event threads, but it creates two event threads of its own(created by discovery thread). But the creator thread (discover thread from infrastructure) can not not wait for these two threads. So main thread waits for these two threads in the close call. Discovery thread calls the discover function every three minutes. But only the close call in the plugin receives the stop signal through the main thread. So that needs to be communicated to other threads in the plugin. Close call (main thread) frees all the resources, so it has to close all the plugin threads before freeing the resources. It can effectively close threads created by the plugin, but not the infrastructure threads. So it needs to find a way to know that other three infrastructure threads can not enter the plugin before it frees the resources. So if we have an ABI function that communicates the stop signal to the plugins before close is called, that could help. Infrastructure could inform plugins that a signal is received and go ahead and close its own threads, and then call close. Plugin could close all its threads as soon as the infrastructure send the signal and free the resources when the close function is called after verifying all the threads are closed. One of the other ways is to ensure that the plugin is not in the plugin code executed by one of the infrastructure threads. This could be accomplished by the variables. Discovery function could set a plugin specific variable could be set as soon as it enters the discovery and unset after discovery is done. oa_soap_close after setting the shutdown_event_thread, could wait for the event threads and wait for the variable to unset. When the variable is unset assume that discovery will not be entered free the plugin resources.
Please take a look at the code and see how we could be sure that we are out of discovery before freeing the resources.
Original comment by: dr_mohan
Original comment by: dr_mohan
In addition, the discovery takes a long time, so the shutdown_event_thread variable could be checked between major component discoveries like after blades are discovered, after interconnects are discovered, after power supplies etc. This will reduce the time too, if the discovery is in progress.
Original comment by: dr_mohan
Yet another thing is OA switchover takes a long time to complete. So we have big blocks of sleep (OA_STABILIZE_MAX_TIME) before re-discovery. We need to have a loop so that we sleep in small chunks and to check for the OA_SOAP_CHEK_SHUTDOWN_REQ. If OA_SOAP_CHEK_SHUTDOWN_REQ is issued we could return immediately. This small chunk could be a maximum of 10 seconds. All the other big sleep statements also needs to be in a loop for the same reason.
Original comment by: dr_mohan
This patch consistently reproduces the problem
--- plugins/oa_soap/oa_soap_discover.c (revision 7624) +++ plugins/oa_soap/oa_soap_discover.c (working copy) @@ -333,6 +333,8 @@ * If the thread_handler is not NULL, then the event threads are * already created and skip the event thread creation */
Original comment by: dr_mohan
Original comment by: dr_mohan
Fixed with checkin #7629.
Original comment by: dr_mohan
kill openhpid_pid does kill the threads in 20 seconds or so. But sometimes it generates a core file The core file generated has following information Core was generated by `openhpid -c /etc/openhpi/openhpi.conf.oa_ccc2_10_decrypted'. Program terminated with signal 11, Segmentation fault.
0 0x00007f2b1c7261f7 in oa_soap_discover_resources (oh_handler=0x19ca080) at oa_soap_discover.c:212
212 g_mutex_lock(oa_handler->mutex); Missing separate debuginfos, use: debuginfo-install openhpi-3.4.0-1.x86_64 (gdb) backtrace
0 0x00007f2b1c7261f7 in oa_soap_discover_resources (oh_handler=0x19ca080) at oa_soap_discover.c:212
1 0x00000000004150f8 in oh_discovery ()
2 0x0000000000422e4c in discovery_func ()
3 0x00000032f5e62074 in ?? () from /lib64/libglib-2.0.so.0
4 0x00000032f4a077f1 in start_thread () from /lib64/libpthread.so.0
5 0x00000032f46e570d in clone () from /lib64/libc.so.6
(gdb) quit
looks like oa_handler->mutex is NULL or a corrupt pointer. Assigning a lower priority as it happens rarely and observed only when exiting.
Reported by: dr_mohan