goToMain / libosdp

Implementation of IEC 60839-11-5 OSDP (Open Supervised Device Protocol); provides a C library with support for C++, Rust and Python3
https://libosdp.sidcha.dev
Apache License 2.0
130 stars 69 forks source link

Communications go offline and are not restored #110

Closed rm5248 closed 5 months ago

rm5248 commented 1 year ago

Describe the bug We have a situation where the CP and PD device will lose communications with each other, and the CP is unable to reconnect to the PD without being restarted. See additional context for a comprehensive explanation.

Expected behavior Once communications go offline, they should reconnect without having to restart the control panel.

Observed behavior Once the PD and CP communications stop working, the CP is unable to reconnect without being restarted.

Additional Context CP device: Linux laptop, running libosdp(current HEAD of master[883067caa8b11430d9b49393c12010a2016d3dc6]). PD device: custom hardware, running libosdp 2.2.0

On the PD device, I can induce a failure in libosdp by doing a physical action to the device. I believe that this causes some spurious bytes to be received on the serial port, but only on the PD. Once this happens, the CP will not regain communications with the PD at all and seems to eventually stop trying. The PD device will have the following in its log:

osdp: PD: DEBUG: CMD: ID(61) REPLY: PDID(45)
osdp: PD: DEBUG: CMD: CAP(62) REPLY: PDCAP(46)
osdp: PHY: EMERG: RX ring buffer overflow!
osdp: PHY: EMERG: RX ring buffer overflow!
osdp: PD: DEBUG: CMD: ID(61) REPLY: PDID(45)
osdp: PD: DEBUG: CMD: ID(61) REPLY: PDID(45)

Once the 'RX ring buffer overflow!' message appears, communications with the CP will not work until the CP restarts. Eventually the CP tries to re-establish communications, but it keeps complaining that the sequence number is bad. The CP indicates that a single command timedout, and then after that point the sequence numbers are bad:

osdp: CP: ERROR: Response timeout for CMD(60)
osdp: PHY: ERROR: packet seq mismatch 0/2
osdp: PHY: ERROR: packet seq mismatch 1/0
osdp: PHY: ERROR: packet seq mismatch 0/1
osdp: PHY: ERROR: packet seq mismatch 1/0

The following traces came from libosdp:

P_TRACE_SEND: CP->PD[0]: [10] =>
    0000  ff 53 00 09 00 05 61 00  f0 51                    |.S....a..Q      |
P_TRACE_RECV: PD[0]->CP: [8] =>
    0000  53 80 08 00 04 40 59 ac                           |S....@Y.        |
P_TRACE_SEND: CP->PD[0]: [10] =>
    0000  ff 53 00 09 00 05 61 00  f0 51                    |.S....a..Q      |
P_TRACE_RECV: PD[0]->CP: [8] =>
    0000  53 80 08 00 04 40 59 ac                           |S....@Y.        |

Any ideas as to what end this could be happening on and what to debug would be helpful.

rm5248 commented 1 year ago

Doing some further investigation, it appears as though the first time we connect to a device, we send a single packet with a zero sequence, and then communications work correctly. When we try to re-establish the connection, we send two packets with a sequence number of 0.

I added some debug statements, this is what happens on startup:

"osdp: CP: INFO : Setup complete; PDs:1 Channels:1 - libosdp-2.2.0 master (883067c+)"
"osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0xFFFFFFFF.  flags: 0x00002000"
"osdp: PHY: DEBUG: send packet with seq num 0"
"osdp: CP: DEBUG: check packet: -8"
"osdp: PHY: DEBUG: rx packet with seq num 0"
"osdp: PHY: DEBUG: cur: 0 comp: 0"
"osdp: CP: DEBUG: check packet: 0"
"osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0x0.  flags: 0x00002810"
"osdp: CP: DEBUG: PD 0 init.  After poll.  Seq num 0x0.  flags: 0x00002800"
"osdp: CP: DEBUG: Setting state to 1 for PD 0"
"osdp: PHY: DEBUG: send packet with seq num 1"
"osdp: CP: DEBUG: check packet: -8"
"osdp: CP: DEBUG: check packet: -2"
"osdp: PHY: DEBUG: rx packet with seq num 1"
"osdp: CP: DEBUG: check packet: 0"
"osdp: CP: DEBUG: CMD: ID(61) REPLY: PDID(45)"

When communications are lost and then we attempt to re-establish them, the following happens:

"osdp: CP: ERROR: Response timeout for CMD(60).  Seq num 2"
"osdp: CP: DEBUG: Setting state to 0 for PD 0"
"osdp: CP: DEBUG: PD 0 reset to init state.  Seq num 0xFFFFFFFF.  Flags: 0x00002801"
"osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0xFFFFFFFF.  flags: 0x00002801"
"osdp: PHY: DEBUG: send packet with seq num 0"
"osdp: PHY: DEBUG: rx packet with seq num 2"
"osdp: PHY: DEBUG: cur: 0 comp: 2"
"osdp: PHY: ERROR: packet seq mismatch 0/2"
"osdp: CP: DEBUG: check packet: -6"
"osdp: CP: DEBUG: Setting state to 0 for PD 0"
"osdp: CP: DEBUG: PD 0 reset to init state.  Seq num 0xFFFFFFFF.  Flags: 0x00002001"
"osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0xFFFFFFFF.  flags: 0x00002001"
"osdp: PHY: DEBUG: send packet with seq num 0"
"osdp: PHY: DEBUG: rx packet with seq num 0"
"osdp: PHY: DEBUG: cur: 0 comp: 0"
"osdp: CP: DEBUG: check packet: 0"
"osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0x0.  flags: 0x00002811"
"osdp: CP: DEBUG: PD 0 init.  After poll.  Seq num 0x0.  flags: 0x00002801"
"osdp: CP: DEBUG: Setting state to 1 for PD 0"
"osdp: PHY: DEBUG: send packet with seq num 1"
"osdp: PHY: DEBUG: rx packet with seq num 0"
"osdp: PHY: DEBUG: cur: 1 comp: 0"
"osdp: PHY: ERROR: packet seq mismatch 1/0"
"osdp: CP: DEBUG: check packet: -6"

I'm not sure which side is at fault here, or if they are both wrong in different ways.

rm5248 commented 1 year ago

More debugging notes: I've done a bit more hacking of libosdp, and it seems that what is happening is that things are getting buffered somewhere, but I'm not sure where.

On the PD side, we get the RX ring buffer overflow, but we do send out a packet with seq number 1 and a payload of 19 bytes:

[1680661074.141]osdp: PHY: DEBUG: rx packet with seq num 2
[1680661074.141]osdp: PHY: DEBUG: send packet with seq num 2
wrote 9 bytes
[1680661074.314]osdp: PHY: DEBUG: rx packet with seq num 3
[1680661074.314]osdp: PHY: DEBUG: send packet with seq num 3
wrote 9 bytes
[1680661075.230]osdp: PHY: EMERG: RX ring buffer overflow!
[1680661075.230]osdp: PHY: DEBUG: rx packet with seq num 1
[1680661075.230]osdp: PHY: DEBUG: send packet with seq num 1
wrote 19 bytes
[1680661075.291]osdp: PHY: EMERG: RX ring buffer overflow!
[1680661075.804]osdp: PHY: DEBUG: rx packet with seq num 0
[1680661075.804]osdp: PHY: DEBUG: CP restart comms.  Packet len: 8 packet buf_len: 8
[1680661075.804]osdp: PHY: DEBUG: cur: 0 comp: 0
[1680661075.804]osdp: PHY: DEBUG: send packet with seq num 0
wrote 9 bytes
[1680661077.876]osdp: PHY: DEBUG: rx packet with seq num 0
[1680661077.877]osdp: PHY: DEBUG: CP restart comms.  Packet len: 8 packet buf_len: 8
[1680661077.877]osdp: PHY: DEBUG: cur: 0 comp: 0
[1680661077.877]osdp: PHY: DEBUG: send packet with seq num 0
wrote 9 bytes
[1680661077.969]osdp: PHY: DEBUG: rx packet with seq num 1
[1680661077.969]osdp: PD: DEBUG: CMD: ID(61) REPLY: PDID(45)
[1680661077.969]osdp: PHY: DEBUG: send packet with seq num 1
wrote 20 bytes
[1680661082.052]osdp: PHY: DEBUG: rx packet with seq num 0
[1680661082.052]osdp: PHY: DEBUG: CP restart comms.  Packet len: 8 packet buf_len: 8
[1680661082.052]osdp: PHY: DEBUG: cur: 0 comp: 0
[1680661082.052]osdp: PHY: DEBUG: send packet with seq num 0
wrote 9 bytes

On the CP side, we process a timeout of the last command(with seq 1), but the response to packet 1 comes in after the timeout occurs:

"[1680650264.215]osdp: CP: DEBUG: check packet: -2"
have 3 bytes
"[1680650264.241]osdp: PHY: DEBUG: rx packet with seq num 3"
"[1680650264.241]osdp: CP: DEBUG: check packet: 0"
"[1680650264.341]osdp: PHY: DEBUG: send packet with seq num 1"
"[1680650264.365]osdp: CP: DEBUG: check packet: -8"
"[1680650264.391]osdp: CP: DEBUG: check packet: -8"
"[1680650264.416]osdp: CP: DEBUG: check packet: -8"
"[1680650264.440]osdp: CP: DEBUG: check packet: -8"
"[1680650264.466]osdp: CP: DEBUG: check packet: -8"
"[1680650264.490]osdp: CP: DEBUG: check packet: -8"
"[1680650264.516]osdp: CP: DEBUG: check packet: -8"
"[1680650264.541]osdp: CP: DEBUG: check packet: -8"
"[1680650264.566]osdp: CP: DEBUG: check packet: -8"
"[1680650264.566]osdp: CP: ERROR: Response timeout for CMD(60).  Seq num 1"
have 1 bytes
have 16 bytes
have 2 bytes
"[1680650265.590]osdp: CP: DEBUG: Setting state to 0 for PD 0"
"[1680650265.591]osdp: CP: DEBUG: PD 0 reset to init state.  Seq num 0xFFFFFFFF.  Flags: 0x00002801"
"[1680650265.616]osdp: CP: DEBUG: PD 0 init.  Poll.  Seq num 0xFFFFFFFF.  flags: 0x00002801"
"[1680650265.641]osdp: PHY: DEBUG: send packet with seq num 0"
"[1680650265.666]osdp: PHY: DEBUG: rx packet with seq num 1"
"[1680650265.667]osdp: PHY: DEBUG: cur: 0 comp: 1"
"[1680650265.667]osdp: PHY: ERROR: packet seq mismatch 0/1"
"[1680650265.667]osdp: CP: DEBUG: check packet: -6"

So the first packet that we are processing on the CP side is the packet with the sequence of 1(since it got held up someplace), apparently causing all of the packet processing to fail since it is processing the wrong packet.

Implementing the osdp_pd_info_t.channel.flush method at least now causes the connection to recover(by discarding all of the data), but it seems like there may still be an underlying bug where if a packet is received after the timeout has expired it is not dropped appropriately.

sidcha commented 1 year ago

@rm5248 Thanks for the detailed bug report. I am a bit occupied with other projects ATM. Will take a look at this ASAP.

rm5248 commented 1 year ago

Here is a minimum example(in C++ because it was easier to do the threading).

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <osdp.h>
#include <chrono>
#include <thread>

struct RxBuffer{
    uint8_t buffer[256];
    int buffer_len;
};

static struct RxBuffer cp_rx_buffer;
static struct RxBuffer pd_rx_buffer;

static int cp_send_func(void *data, uint8_t *buf, int len)
{
    (void)(data);

    memcpy( pd_rx_buffer.buffer + pd_rx_buffer.buffer_len, buf, len );
    pd_rx_buffer.buffer_len += len;

    return len;
}

static int cp_recv_func(void *data, uint8_t *buf, int len)
{
    (void)(data);

    if( cp_rx_buffer.buffer_len == 0 ){
        return 0;
    }

    int maxBytesToCopy = len < cp_rx_buffer.buffer_len ? len : cp_rx_buffer.buffer_len;

    memcpy( buf, cp_rx_buffer.buffer, maxBytesToCopy );

    if(maxBytesToCopy != cp_rx_buffer.buffer_len){
        printf("wrong bytes to copy\n");
    }

    // Clear the buffer
    cp_rx_buffer.buffer_len = 0;

    return maxBytesToCopy;
}

static int pd_send_func(void *data, uint8_t *buf, int len)
{
    (void)(data);

    memcpy( cp_rx_buffer.buffer + cp_rx_buffer.buffer_len, buf, len );
    cp_rx_buffer.buffer_len += len;

    return len;
}

static int pd_recv_func(void *data, uint8_t *buf, int len)
{
    (void)(data);

    if( pd_rx_buffer.buffer_len == 0 ){
        return 0;
    }

    int maxBytesToCopy = len < pd_rx_buffer.buffer_len ? len : pd_rx_buffer.buffer_len;

    memcpy( buf, pd_rx_buffer.buffer, maxBytesToCopy );

    if(maxBytesToCopy != pd_rx_buffer.buffer_len){
        printf("wrong bytes to copy\n");
    }

    // Clear the buffer
    pd_rx_buffer.buffer_len = 0;

    return maxBytesToCopy;
}

int pd_command_handler(void *arg, struct osdp_cmd *cmd)
{
    (void)(arg);

    printf("PD: CMD: %d\n", cmd->id);
    fflush(stdout);

    if(cmd->id == OSDP_CMD_MFG){
        std::this_thread::sleep_for(std::chrono::milliseconds(250));
    }

    return 0;
}

osdp_pd_info_t pd_info[1];

osdp_pd_info_t info_pd;

void pd_thread(void* data){
    osdp_t *pd_ctx;

    pd_ctx = data;

    while (1) {
        // your application code.
        osdp_pd_refresh(pd_ctx);

        std::this_thread::sleep_for(std::chrono::milliseconds(20));
    }
}

int main()
{
    osdp_t *cp_ctx;
    osdp_t *pd_ctx;

    pd_info[0].baud_rate = 115200;
    pd_info[0].address = 101;
    pd_info[0].flags = 0;
    pd_info[0].channel.recv = cp_recv_func;
    pd_info[0].channel.send = cp_send_func;
    pd_info[0].scbk = NULL;

    struct osdp_pd_cap caps[3];
    caps[0].function_code = OSDP_PD_CAP_READER_LED_CONTROL;
    caps[0].compliance_level = 1;
    caps[0].num_items = 1;
    caps[1].function_code = OSDP_PD_CAP_READER_AUDIBLE_OUTPUT;
    caps[1].compliance_level = 1;
    caps[1].num_items = 1;
    caps[2].function_code = -1;
    caps[2].compliance_level = 0;
    caps[2].num_items = 0;
    info_pd.baud_rate = 115200;
    info_pd.address = 101;
    info_pd.flags = 0;
    info_pd.channel.send = pd_send_func;
    info_pd.channel.recv = pd_recv_func;
    info_pd.id = {
        .version = 1,
        .model = 153,
        .vendor_code = 31337,
        .serial_number = 0x01020304,
        .firmware_version = 0x0A0B0C0D,
    };
    info_pd.cap = caps;
    info_pd.scbk = NULL;

    osdp_logger_init(OSDP_LOG_DEBUG, NULL);

    memset(&cp_rx_buffer, 0, sizeof(cp_rx_buffer));
    memset(&pd_rx_buffer, 0, sizeof(pd_rx_buffer));

    cp_ctx = osdp_cp_setup2(1, pd_info);
    if (cp_ctx == NULL) {
        printf("cp init failed!\n");
        return -1;
    }

    pd_ctx = osdp_pd_setup(&info_pd);
    if (pd_ctx == NULL) {
        printf("pd init failed!\n");
        return -1;
    }

    osdp_pd_set_command_callback(pd_ctx, pd_command_handler, NULL);

    int x = 0;

    std::thread th(pd_thread, pd_ctx);

    while (1) {
        osdp_cp_refresh(cp_ctx);

        std::this_thread::sleep_for(std::chrono::milliseconds(20));

        if(x++ == 25){
            struct osdp_cmd cmd;
            cmd.id = OSDP_CMD_MFG;
            cmd.mfg.command = 0;
            cmd.mfg.length = 10;

            osdp_cp_send_command(cp_ctx, 0, &cmd);
        }
    }
    return 0;
}
marjakm commented 11 months ago

I also experienced this issue, but with an older libosdp version. I did not do very thourough tests, but it seems that libosdp does not handle cases where messages come too late well - when responses come after the next message has been sent out. This might not happen when directly connected to a serial line, but I'm buffering the messages and sending and receiving will not collide on the serial line. I had this seq number issue when only one pd was on the line, but also had a issue where libosdp expected a response to PD 2, but got response to timed out PD 1 packet. I have some hacks that seem to work for my case, that I can share, but a more thought-out solution would be better.

glownt commented 9 months ago

I'm having the same problem. I am running version 2.1.0, but I assume this is still not fixed in the current version. You can have a look to the log here:

[ ][16:11:33.710][1995509632]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: INFO : Setup complete; PDs:1 Channels:1 - libosdp-2.1.0 master (d292cc9+)
[ ][16:11:33.720][1863271360]-[static void OSDP_CP_Manager::worker(std::vector<OSDP_PD_Reader*>&)]:OSDP thread running
[ ][16:11:33.934][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: CMD: ID(61) REPLY: PDID(45)
[ ][16:11:34.057][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'ContactStatusMonitoring' (4/1)
[ ][16:11:34.057][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'OutputControl' (4/1)
[ ][16:11:34.057][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'CardDataFormat' (1/0)
[ ][16:11:34.058][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'LEDControl' (4/1)
[ ][16:11:34.058][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'AudibleControl' (2/1)
[ ][16:11:34.058][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'TextOutput' (0/0)
[ ][16:11:34.059][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'TimeKeeping' (0/0)
[ ][16:11:34.059][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: Reports capability 'CheckCharacter' (1/0)
[ ][16:11:34.059][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: CMD: CAP(62) REPLY: PDCAP(46)
[ ][16:20:39.078][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:OSDP EVENT!
[ ][16:20:39.079][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:type: 0
[ ][16:20:39.079][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:reader_no: 0
[ ][16:20:39.080][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:format: 1
[ ][16:20:39.081][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:length: 32
[d][16:20:39.081][1863271360]-[static int OSDP_CP_Manager::eventFunc(void*, int, osdp_event*)]:data:  CD 1F 66 34 00 00 00 00 00 00 00 00 58 3D 0F 6F C0 43 0F 6F 80 0F F1 76 80 3E 0F 6F 00 00 00 00
[ ][16:20:39.248][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: CMD: LED(69) REPLY: ACK(40)
[ ][16:20:39.340][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: DEBUG: PD[0]: CMD: BUZ(6a) REPLY: ACK(40)
[ ][18:10:20.024][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: CP: ERROR: PD[0]: Response timeout for CMD(60)
[ ][18:10:21.141][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 0/2
[ ][18:10:23.343][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 1/0
[ ][18:10:27.443][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 0/1
[ ][18:10:35.652][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 1/0
[ ][18:10:51.747][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 0/1
[ ][18:11:23.948][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 1/0
[ ][18:12:28.056][1863271360]-[static int OSDP_CP_Manager::logFunc(const char*)]:osdp: PHY: ERROR: PD[0]: packet seq mismatch 0/1
marjakm commented 9 months ago

These two patches have worked for me (based on v2.3.0): https://github.com/bislydev/libosdp/commit/7a283e486f492e04836d5378bd900ad39ccefac7 https://github.com/bislydev/libosdp/commit/43666cf4e37cde1e5522ab135b408e0c0b73e114

sidcha commented 5 months ago

@rm5248, et, al. Thanks for the minimal reproducer; after fixing a minor bug in the reproducer (it was not setting a flush callback hence some stale data was still buffered when LibOSDP expected there to be none), I can no longer reproduce this issue.

I guess one of the many dozens of patches to clean up the CP state machine has fixed this issue.

Closing this as complete. Feel free to reopen if that is not the case.