openhpi2 / openhpi_bug_test

0 stars 0 forks source link

Polling SEL fails when reservation lost #1670

Open openhpi2 opened 13 years ago

openhpi2 commented 13 years ago

When polling the SEL, a check is made to see if the SEL Info data has changed. If not, no reading of the SEL is done. If it has changed, the SEL is read to capture the new events. Then, if during reading the SEL the reservation is lost, it goes back and reads the SEL Info again. At this point, it no longer sees a change - because it compares with the data just read. As a result, the new events on the SEL are not read and processed until the SEL Info record changes yet again (i.e., when the next event is generated).

In addition, the function that reads the SEL is very inefficient by re-reading the entire SEL every time it changes, instead of trying to read just the new entries.

I have prototyped changes that fix this problem, and avoids re-reading the entire SEL on each update, which can be committed after the 3.0 release is out.

Reported by: davidmckinley

openhpi2 commented 13 years ago

If I recall right the implementation clears SEL after reading. It makes the things little bit more efficient.

Original comment by: avpak

openhpi2 commented 13 years ago

Maybe it is supposed to do that, but it doesn't when I run it. In general, clearing the SEL after every event would not be a good idea, anyway; there may be other users of the IPMI data that would not want the SEL regularly cleared. It could also cause problems with wear-out of the flash memory.

Original comment by: davidmckinley

openhpi2 commented 13 years ago

Original comment by: davidmckinley

openhpi2 commented 9 years ago

Thanks to David for the following comments. This is the biggie. The logic in ipmi_sel.cpp has several problems, but the most severe is that it does not handle a reservation-lost error properly when reading the SEL. This is a perfectly normal condition, which can happen anytime you are reading the SEL. It happens quite often, in fact, with some Sun servers, which is how I came across it.

To explain the problem, I am referencing the OpenHPI 3.0 code in ipmi_sel.cpp - though I don't think it will have changed much, if any, in any recent versions.

Imagine that a new entry has been placed on the SEL, and then a call is made to cIpmiSel::ReadSel() to poll the SEL. Early in that function (around line #298 in ipmi_sel.cpp), there is a call to GetInfo(). GetInfo() will do an IPMI command, eIpmiCmdGetSelinfo, and compare data in the returned message for number of entries, last addition timestamp, and last erase timestamp, with values read on a previous call (lines #160-168). If they are unchanged, it returns -1, indicating that there are no new events to read. In our case, since there was a new event added, it will return SA_OK.

Note that GetInfo() also sets member variables equal to the values just read, so that on the next call, it will compare against these latest values (lines #147-155 and #170-171).

After the call to GetInfo(), the ReadSel() function will call Reserve() to reserve the SEL, and then start reading records from the SEL in the loop at lines #321-343. If, during this process, there is an error return, "eIpmiCcInvalidReservation", meaning the reservation was lost, then it breaks out of this loop (lines #334-335), and since it has not read to the end of the SEL, continues the outer loop which retries processing the whole SEL (that is, it does not break out of the outer loop at lines #345-346).

This results in a new call to GetInfo(), which this time very likely returns -1, because there has not been any change to the SEL since the last iteration. Just because someone took the reservation does not mean that the SEL has been changed, or will be changed again. Thus, ReadSel() returns reporting that there are no new SEL entries, and will continue to do so on all subsequent polls until another entry is added to the SEL.

The net result is that events can be delayed being reported for a long time, until some subsequent event is generated.

A possible fix for this problem is to create a number of local variables in the ReadSel() function which can be used to reset the member variables updated by GetInfo() to what they were when ReadSel() was first called in the case of a retry. This way, each retried call to GetInfo() will correctly compare the current state with what it was when the poll was initiated.

Specifically, this code at the beginning of ReadSel():

GList cIpmiSel::ReadSel( unsigned int &num, bool &uptodate ) { SaErrorT rv = SA_OK; GList new_events = 0; num = 0; int fetch_retry_count = 0; uptodate = false;

while( true ) { if ( fetch_retry_count >= dMaxSelFetchRetries ) { stdlog << "too many lost reservations in SEL fetch !\n"; return 0; }

   fetch_retry_count++;

Can be changed to this:

cIpmiSel::ReadSel( unsigned int &num, bool &uptodate ) { SaErrorT rv = SA_OK; GList *new_events = 0; num = 0; int fetch_retry_count = 0; uptodate = false;

// Save current sel object state, to restore if we need to retry bool overflow = m_overflow; unsigned short entries = m_entries; unsigned int last_addition_timestamp = m_last_addition_timestamp; unsigned int last_erase_timestamp = m_last_erase_timestamp; bool fetched = m_fetched; bool sels_changed = m_sels_changed;

while( true ) { if ( fetch_retry_count ) { m_overflow = overflow; m_entries = entries; m_last_addition_timestamp = last_addition_timestamp; m_last_erase_timestamp = last_erase_timestamp; m_fetched = fetched; m_sels_changed = sels_changed;

         if ( fetch_retry_count >= dMaxSelFetchRetries )
            {
               stdlog << "too many lost reservations in SEL fetch ! " << m_mc->Addr() << "\n";
               return 0;
            }
      }

   fetch_retry_count++;

I would say that this is the minimum that needs to be done to fix the problem. However, there are a couple of related issues that make this a less than satisfactory fix, in my opinion. The current implementation seems almost amazingly inefficient. A cache of the entire SEL is kept in a list (member variable m_sel). But, whenever a new entry is added (detected by GetInfo(), as described above), then the entire SEL is re-read from the device, and a record-by-record comparison is made with the cached copy to see if any of the newly read entries differ from the cached entries. This seems to be based on an idea that new records could be found anywhere within the SEL, rather than just at the end of it.

While I guess that is an extremely conservative approach, it overlooks the fact that reading the entire SEL can, in some systems, be a very lengthy operation - potentially taking tens of seconds if the SEL is large. In part to alleviate this situation, perhaps, if a system is an ATCA system, there is code that clears the SEL upon startup. This is not done for any system type other than ATCA, however, and even for ATCA, the SEL could potentially grow large if it ran for a long time.

Even worse, in the case of lost reservations, each retry attempts to re-read the entire SEL. If the system is active - this could lead to a situation where it is very difficult to read the SEL, because no reservation will last long enough.

ELIMINATING CONSTANT RE-READING OF ENTIRE SEL

To fully understand this issue, as well as a possible solution, it is useful to review a little bit more of the processing within ipmi_sel.cpp. The starting place for reading new events is the cIpmiSel::GetEvents() function. This is called at various times; most notably, it is called periodically (every few seconds) out of the cIpmiMcThread::ReadSel() function (located in ipmi_discover.cpp, lines #714-731).

When GetEvents() is called, it calls ReadSel() and expects to either have the 'uptodate' Boolean parameter set to 'true', indicating no new events were detected, or else a list returned that represents the entire list of events in the SEL. In other words, ReadSel() either returns nothing, or it reads and returns the entire SEL content. When a list is returned, it is compared item-by-item with the previously saved set of events in m_sel and m_async_events (which holds things like hotswap events that are normally not found in the SEL), and creates a new list with only those SEL events that were not found in these existing lists.

After this comparison, then the m_sel cached list is replaced by the list returned by ReadSel(), and the created list of new events is returned to the caller.

To avoid this repeated re-reading of the entire SEL, a possible solution involves changing the signature of the ReadSel() function to add an additional Boolean flag that lets ReadSel() signal back whether it is returning the entire SEL, or only newly added entries on the SEL. With the list returned from ReadSel(), the comparison process continues in GetEvents() just as it did before - although we would expect that if only new events are returned, no matches will be found. This seems to be a reasonable safety check to be sure that no events are doubly processed, and since only new events will normally be returned anyway, it will not be much overhead.

After this comparison process, then the 'full_list' flag set by ReadSel() is checked. If it is 'true', then just as before, the entire cached list is replaced. If it is 'false', though, then the newly read and returned events are concatenated onto the existing cached list.

With this improvement in the processing within GetEvents(), the processing in ReadSel() can be much improved.

First, after learning that the SEL has changed via GetInfo(), instead of unconditionally reading the entire SEL starting at record 0, ReadSel() can re-read whatever was the last record read. After this record is re-read, it is compared with the cached copy of that record, to see if it is what it was expected to be. Normally it should be, but if the SEL was cleared since the last poll, it might be that it will not match.

If the re-read of the previously last record read does match, then it is safe to assume that the entire SEL up to that point is unchanged, and only the newly added records after that point need to be read. Thus, that is what ReadSel() proceeds to do, and then it returns only the newly read records, along with the flag indicating that less than the full list is being returned.

If the re-read of the previously last record read does not match, or if any sort of error occurs (other than a lost reservation), then ReadSel() falls back to reading the entire SEL, and returns that entire list, with the flag set indicating that the full list is being returned. Thus, all special cases continue to be handled just as they always have. Only the 'normal' case of simply reading new entries added to an intact SEL are now handled in the new way. But, this eliminates virtually all of the unnecessary re-reading of the SEL, and vastly decreases overhead and latency associated with polling the SEL.

Furthermore, the most likely case of a lost reservation error is when a lot of events are being read from the SEL. In this case, similar processing can be done to pick up where we previously left off in reading the SEL when the reservation was lost. As with the case described above, the first thing that needs to be done is to re-read the last record read successfully before the lost-reservation error, and make sure that it is still there and unchanged. If it is, then ReadSel() should be able to continue reading with the next record. If not, then all of the records read before the reservation was lost should be discarded, and the entire SEL can be read.

As you might imagine from the descriptions above, this sort of update requires quite a few changes, especially in the 'ReadSel()' function. Rather than try to copy each one out, I'm attaching a reference copy of an ipmi_sel.cpp module, and a corresponding ipmi_sel.h module that include all of these described changes, along with changes that implement the previously described minimum fix for bug 1670, and the auto-clear function described below.

AUTO-CLEAR FUNCTION

A final problem with the SEL processing is that there is no generalized capability for clearing the SEL when it gets full. It is not 100% required that there be such a function; some people may think that clearing the SEL is something that ought to be a manual operation, or overseen by higher-level management software, rather than built-in to OpenHPI. But, event management will fail if the IPMI SEL becomes full, so it seems prudent to at least offer an option. It also turns out to be very simple to do so.

As noted above, this problem is partially handled for ATCA systems by some code in ipmi_mc.cpp (lines #272-276) which will clear the SEL of an ATCA system during discovery operations. This does not do anything for non-ATCA systems, however, and is not a complete solution even for ATCA, as a SEL could fill up after this initial clear operation.

A possible solution is to add a configuration option to the ipmidirect stanza in the openhpi.conf file: "selAutoClear = "XXX", where "XXX" is the maximum size to which the SEL should grow. That is, if there are more than this many records in the SEL, then it should be automatically cleared when it is polled (after reading in all the new records, of course). I would also suggest providing a mechanism for setting different values for specific SELs, if there are multiple SELs in different IPMI controllers in a system, for which different sizes are appropriate.

All of the new code required to handle this option can be included in ipmi_sel.cpp. Most of it is what is needed to read and store the new option, added to the cIpmiSel::cIpmiSel() constructor function:

cIpmiSel::cIpmiSel( cIpmiMc _mc, unsigned int lun ) : m_mc( mc ), m_lun( lun ), m_major_version( 0 ), m_minor_version( 0 ), m_entries( 0 ), m_last_addition_timestamp( 0 ), m_last_erase_timestamp( 0 ), m_overflow( false ), m_supports_delete_sel( false ), m_supports_partial_add_sel( false ), m_supports_reserve_sel( false ), m_supports_get_sel_allocation( false ), m_fetched( false ), m_reservation( 0 ), m_sels_changed( false ), m_sel( 0 ), m_sel_num( 0 ), m_async_events( 0 ), m_async_events_num( 0 ), m_autoclear( 0 ) { GHashTable config; if( (config = mc->Domain()->GetHandler()->config) != 0 ) { char keyword[20]; snprintf( keyword, sizeof(keyword), "selAutoClear0x%02x", mc->Addr().m_slaveaddr ); const char* value = (const char)g_hash_table_lookup( config, keyword ); if( value ) { m_auto_clear = strtol( value, 0, 0 ); stdlog << keyword << " = " << m_autoclear << "\n"; } else { value = (const char)g_hash_table_lookup( config, "selAutoClear" ); if( value ) m_auto_clear = strtol( value, 0, 0 ); stdlog << "selAutoClear = " << m_auto_clear << "\n"; } } }

Other than this code, all that is required to actually do the auto-clear is one line of code added very near the end of the ReadSel() function.

Where this test is used to determine that the last record of the SEL has been read:

   if ( next_rec_id == 0xffff )
        break;

Instead, insert:

   if ( next_rec_id == 0xffff )
   {
       if( m_auto_clear && m_entries >= m_auto_clear ) ClearSel();
        break;
   }

Original comment by: dr_mohan