kontron / python-ipmi

A pure python IPMI library
GNU Lesser General Public License v2.1
185 stars 74 forks source link

Transactions retry unassigned variable and infinite loop #159

Closed canteuni closed 4 months ago

canteuni commented 10 months ago

https://github.com/kontron/python-ipmi/blob/8e83f6f1de1c774cc9bdce90955b9917674351d4/pyipmi/interfaces/rmcp.py#L590-L618

I found 2 issues that concerns the function _send_and_receive in interfaces/rmcp.py.

rx_data could be not assigned

What's wrong

If an IPMI device timed out multiple times (its firmware has an issue, it becomes not available on the network for whatever reason ...), a socket.timeout exception will be raised during self._send_ipmi_msg(tx_data) and we increment the retry counter.

Now if we do this until retry > self.max_retries, we will get out of the while loop and the rx_data will not be set, resulting in an exception during the final return rx_data[6:-1].

Moreover, we will loose the information that the IPMI device was unreachable because it timed out multiple times

Potential fix proposition

When we get out of the loop, we can check if retry > self.max_retries and raise a sort of MaxRetryExceeded or Timeout exception that can be catch elsewhere if needed.

We can also add some details in an "error" variable if we get inside the except socket.timeout block, and include this error variable inside the MaxRetryExceeded details at the end if we raise it.

    errors = ""
    ...
    except socket.timeout as exc:
        retry += 1
        errors += "error in _send_and_receive: " + exc.__str__()

if retry > self.max_retries:
    raise MaxRetryExceeded("custom exception", errors)

return rx_data[6:-1]

The transaction can infinite loop

What's wrong

I currently have a weird bug during a heavy load on pyipmi library where some of my IPMI requests results in a NetFn mismatch, sequence number mismatch and command id mismatch at the same time. The issue is not about having these mismatch, but the behavior of the transaction retry while received is False: :

Potential fix proposition

This while loop with the Queue workflow will only work for bridged message, any other message will not work because rx_data will be the same at every iteration. If the check rx_filter() fail once, it will fail every time forever.

We can add a counter just like self.max_retries and raise an exception if we exceed it, this way we will stop at a given time. We can also say that a message that is not a bridged message should not fail the rx_filter() check, thus we can immediately raise an exception :

if received is False and array('B', rx_data)[5] != constants.CMDID_SEND_MESSAGE:
    raise Exception("custom exception here")

I'm totally willing to contribute to this if needed, just not 100% sure what's the best approach / how do you want to address these. I'm also aware that these are quite hard to reproduce but I hope my explanations were clear enough to "see" the bug without having to test it.

hthiery commented 4 months ago

fixed by #160