Node Stall Ready check does not perform search in stall queue as intended

BLuedtke commented 1 year ago

When sending a message via a node, first bidib determines if the node is ready to send.
This involves, among other things, checking whether the node is stalled, or, whether a supernode of the node is stalled. The stall check is performed in a function here.

If the node or a supernode is stalled, the linked function returns false. A node is stalled if the member "stall" is true and the address_stack (input param to linked fct) is NOT contained in the member stall_affected_nodes_queue.
Thus, the linked fct attempts to find the addr_stack in the queue stall_affected_nodes_queue by using g_queue_find, see line 114. As the addr_stack is an array of 4 uint8_t's, a custom comparison function is used, as the comparison shall be element-wise, not a pointer comparison. This comparator is called bidib_node_stall_queue_entry_equals (see line 99ff.).
The call to g_queue_find looks as follows: !g_queue_find(state->stall_affected_nodes_queue, bidib_node_stall_queue_entry_equals).
Note that the element to find is not actually given here! This code just always returns true, the comparator is never called. g_queue_find is supposed to be called with the queue and the element to find (see here). To use a custom comparator, we need to use g_queue_find_custom (see here). It shall be called like this: !g_queue_find_custom(state->stall_affected_nodes_queue, addr_stack, (GCompareFunc)bidib_node_stall_queue_entry_equals).

The unit tests related to this part (bidib_send_tests) pass as the "stall" member is apparently sufficient to indicate whether the node is stalled or not. However, there's probably a reason for the existence of the stall_affected_nodes_queue, so this comparison call should be fixed.

BLuedtke commented 1 year ago

Admittedly, I'm a bit confused as to why the check for presence of addr_stack in stall_affected_nodes_queue is negated.
With what happens in the if-body, aka the addr_stack being added to said queue, a second call to bidib_node_stall_ready will return true, even though the node is still stalled - since the address is now present in the queue.

Only one thing prevents bidib from (incorrectly) sending a message over a stalled node: All invocations of bidib_node_stall_ready also check that the node's message queue is empty. However, in function bidib_node_try_send, a message is added to this queue when the node is stalled (i.e., on the first call to bidib_node_stall_ready where it'll return false as expected). In the next attempt to send a message (via bidib_node_try_send), the node is recognized as "not ready" purely because the message queue is not empty.

I can only guess the original intention. One possibility is that this should've been two checks: One check for the "stall" member of the node_state. Then, if the node is stalled, but the address of the stalled node or supernode is not yet present in the stall_affected_nodes_queue, add the address to the queue.

eyip002 commented 1 year ago

Reference to MSG_STALL in the bidib protocol: http://bidib.org/protokoll/bidib_general_messages.html

This line looks suspicious, should probably be message[data_index]: https://github.com/uniba-swt/libbidib/blob/9f3942731501f0c70d433886098d5e0b0c9d4bab/src/transmission/bidib_transmission_receive.c#L353

The idea of stall_affected_nodes_queue was probably to implement a way to check whether a super-node is has been stalled when trying to send a command to one of its sub-nodes.

When a MSG_STALL is received, bidib_node_update_stall(addr_stack, stall_status) is called. The addr_stack contains the address of the node that sent MSG_STALL and also the addresses of its super-nodes. Because addr_stack is a fixed array of size 4, the unused elements above the top of the stack are represented by 0x00. If a node determines that it needs to stall to avoid overwhelming its sub-nodes, then we should pause any sending of commands to the node and its sub-nodes.

Take the function discussed above in https://github.com/uniba-swt/libbidib/issues/28#issue-1621654175: https://github.com/uniba-swt/libbidib/blob/9f3942731501f0c70d433886098d5e0b0c9d4bab/src/transmission/bidib_transmission_node_states.c#L108-L130

When a stalled node is found, the original addr_stack is pushed into the node's stall_affected_nodes_queue: https://github.com/uniba-swt/libbidib/blob/9f3942731501f0c70d433886098d5e0b0c9d4bab/src/transmission/bidib_transmission_node_states.c#L119

Otherwise, we reach the for-loop that zeros out the top of the addr_cpy. The copy of the address stack now addresses a super-node closer to the root node (addr_cpy[0] is the root node): https://github.com/uniba-swt/libbidib/blob/9f3942731501f0c70d433886098d5e0b0c9d4bab/src/transmission/bidib_transmission_node_states.c#L122-L127 The outer while-loop repeats until a stalled node is found or all nodes in addr_stack have been considered.

As a result, the stall_affected_nodes_queue of a node may contain an address to itself and/or addresses to its sub-nodes. When a node is no longer stalled, we should try and resume the sending of commands to its sub-nodes and itself: https://github.com/uniba-swt/libbidib/blob/9f3942731501f0c70d433886098d5e0b0c9d4bab/src/transmission/bidib_transmission_node_states.c#L232-L237

uniba-swt / libbidib

Node Stall Ready check does not perform search in stall queue as intended #28