ActionId of Commands with > 1 response not traced correctly

BLuedtke commented 1 year ago

When a node sends a command, and the recipient of the command is expected to reply with a message of a certain type ("response type"), we save that a reply of a certain type addressed to the sending node is to be expected. This "saving" happens by adding an entry into the so-called "response queue". A response queue entry holds the response-type to be expected, the time it was created, and the action id of the sent command.

An example for how this is shown in the log can be seen below:

libbidib: Send to: 0x00 0x00 0x00 0x00 seq: 47 type: MSG_CS_DRIVE (0x64) action id: 242 libbidib: Message bytes: 0x0c 0x00 0x2f 0x64 0x06 0x00 0x03 0x01 0x80 0x00 0x00 0x00 0x00 libbidib: Received packet libbidib: Received from: 0x00 0x00 0x00 0x00 seq: 104 type: MSG_CS_DRIVE_ACK (0xe2) action id: 242 libbidib: Message bytes: 0x06 0x00 0x68 0xe2 0x06 0x00 0x01 libbidib: Feedback for action id 242: Train: cargo_green acknowledgement level: 1

A message of type MSG_CS_DRIVE is sent, its action id is 242. Later, a message of type MSG_CS_DRIVE_ACK is received, which matches the expected response type. It is thus traced to be related to action id 242, which is logged accordingly. In the operation of matching the received response against the expected response, i.e. the head of the response queue, the first entry of the response queue is consumed.

The functionality/strategy outlined above works great for commands that result in at most one expected reply. However, there are commands which may result in one or two responses. For example: The command to set a point (an accessory) to a certain state may cause one or two replies of type MSG_ACCESSORY_STATE. This can be seen in the shortened log below:

libbidib: Switch point: point4 on board: onecontrol3 (0x0c 0x00 0x00 0x00) to aspect: normal (0x01) with action id: 243 libbidib: Send to: 0x0c 0x00 0x00 0x00 seq: 10 type: MSG_ACCESSORY_SET (0x38) action id: 243 [...] libbidib: Received packet libbidib: Received from: 0x0c 0x00 0x00 0x00 seq: 66 type: MSG_ACCESSORY_STATE (0xb8) action id: 243 libbidib: Message bytes: 0x09 0x0c 0x00 0x42 0xb8 0x01 0x01 0x02 0x01 0x03 libbidib: Feedback for action id 243: Point accessory: point4 execution: normal not reached verified with wait time: 0.3s [...] libbidib: Received packet libbidib: Received from: 0x0c 0x00 0x00 0x00 seq: 70 type: MSG_ACCESSORY_STATE (0xb8) action id: 0 libbidib: Message bytes: 0x09 0x0c 0x00 0x46 0xb8 0x01 0x01 0x02 0x00 0x00 libbidib: Feedback for action id 0: Point accessory: point4 execution: normal reached verified with wait time: 0.0s

The command (message type MSG_ACCESSORY_SET) is to set the accessory "point4" to the aspect "normal", its action id is 243. A response queue entry is created, the expected response type is MSG_ACCESSORY_STATE.
The first reply of type MSG_ACCESSORY_STATE matches the entry in the response queue and is thus logged to be related to action id 243. The message content tells us that point4 has not yet reached aspect "normal", and that the remaining time to wait for the accessory to reach the target aspect is 0.3 seconds. The corresponding response queue entry is removed.
0.3s later, the second message of type MSG_ACCESSORY_STATE is received, telling us that point4 has now reached the target aspect "normal". Receiving the first message of this type lead to the removal of the response queue entry. Therefore, this message does not match any entry in the response queue. It can't be related to a non-zero action id and is thus printed with action id 0.
Note that a MSG_ACCESSORY_SET can also result in just one reply. For example, when setting a signal to a state, only one response of type MSG_ACCESSORY_STATE will be received.

The functionality for relating responses to commands and their action id should be able to correctly relate action ids with more than one expected response.

Thoughts:
Part of what makes this tricky is that the current implementation works with a queue and only compares received messages against the head of said queue. This works based on the assumption that responses are received in the same order as the commands were sent.
As there may be a considerable delay between the first and second MSG_ACCESSORY_STATE, it is not an option to just put in two response queue entries in succession to catch these. A different command with its own expected response may be sent between the first and second response from the accessory, and receive its response before the second accessory response arrives. In that case, the response to the "new" command wouldn't match the head of the response queue and be discarded.

eyip002 commented 1 year ago

A perfect solution probably does not exist because the BiDiB boards themselves do not track requests with their respective feedback. We can only make an educated guess on what feedback we expect for each request type.

akuhtz commented 1 year ago

Maybe you can adjust the logic: You can evaluate the bit 7 and bit 0 of the EVALUATE parameter of MSG_ACCESSORY_STATE. If both bits are 0 the "last" message of normal handling was received and you can remove the entry. If bit 7 is 1 the "last" message of error handling is received and you can remove the entry.

BLuedtke commented 1 year ago

Yeah, that works for knowing when the "last" reply of type MSG_ACCESSORY_STATE is received. Thanks for the input!
However, this does not get around the queue problem: A response to a different command may be expected between the first and second MSG_ACCESSORY_STATE response. If we do not pop the head of the response queue after the first MSG_ACCESSORY_STATE, the response to the unrelated command would be compared to the "wrong" response queue entry. However, there might be some tricks to get around that.

Edit: One workaround just came to my mind. In the response queue entries we do save the "creation time". Now, when receiving the first message of MSG_ACCESSORY_STATE, this message tells us how long we'll have to wait. In the example, that was 0.3 seconds. If we could manipulate the queue, we could move the response queue entry for MSG_ACCESSORY_STATE backwards when receiving the first reply, moving it until its (creationtime + wait time) is later than the creationtime of the previous response queue entry. That would also have to be done when new messages are added to the response queue afterwards.

akuhtz commented 1 year ago

@BLuedtke I had just a quick look at the linked response queue and if I understand it correct, there is 1 queue per node.

Example: If you set an aspect on 2 accessories on the same node (almost simultaneously). The first aspect takes very long to complete, and the second is very short. The "final" execution state of the second aspect is delivered before the "final" execution state of the first aspect.

If I get it right then the evaluation is currently not correct, right? You should be able to test it with 2 macros (1st with multiple steps of delay, 2nd with only a short delay).

I'm not sure if moving the entry is the solution. If you move the entry (because you got the wait message) and then an error happens during execution (Murphy ...), the node will send an error response immediately, before the other entry is received. Now the entry is in the wrong position.

Maybe I didn't get the problem right, but maybe this examples help you.

BLuedtke commented 1 year ago

I'm not sure if moving the entry is the solution. If you move the entry (because you got the wait message) and then an error happens during execution (Murphy ...), the node will send an error response immediately, before the other entry is received. Now the entry is in the wrong position.

I'm not sure if the error response would be sent immediately. Based on the spec, I'd expect to receive a message with the appropriate error code after the wait time of the first message. Of course, if something such as a short circuit occurs, we can not rely on the rest of the messages being received as we expect. They would time out after roughly 3 seconds.

akuhtz commented 1 year ago

@BLuedtke The expected wait time is sometimes a guess. I have seen nodes that send weird wait time values ... ... and I'm pretty sure the error response is sent immediately :-)

BLuedtke commented 1 year ago

Interesting. With our use of the Fichtelbahn control boards, the wait time seems to be directly related to how we configure the servo motor PWM settings with the BiDiB Wizard. But definitely good to know about the error being sent immediately.
I think a bigger overhaul of the implementation of response tracing in our source is needed eventually. Both the current datatypes and functions for parsing are unsuitable for modification and/or rather hard to read.

uniba-swt / libbidib

ActionId of Commands with > 1 response not traced correctly #25