Min ID and other questions ?

min-protocol / min

The MIN protocol specification and reference implementation

258 stars 88 forks source link

Min ID and other questions ? #11

Open erakis opened 6 years ago

erakis commented 6 years ago

Hi @kentindell,

I was implementing the HDLC protocol and while looking for more information about sliding window information I finally came across your library. *You did a really nice work, I like it. Congratulations.

I need something similar and I think that re-implementing HDLC from scratch would not be productive since I discover min.

But I have a couple of questions to which I can not find clear explanations.

What is the purpose of min_id parameters ? I dig into the exemple and source code but I'm not sure I understand the purpose/reason. Is it a kind of abstraction around the sequence number of the protocole ? Or it's more like an ID to recognize the min context associate with the operation ?
I'm saw a some magic number, 0x3fU, 0x80U, 0x55u from the source code. Would it be possible to to declare those in a define with the suffix of _MASK ? Ex. #define IS_TRANSPORT_MASK = 0x.... Also could you give me more details about these 3 masks ? Not about the arithmetic, but about the purpose of them ?
What is the purpose of adding Payload Length ? Don't you already can deduce it with the (EOF - SOF - bytes stuffing - header length - crc) = payload length ?
Why using 3 times the SOF (0xaa) instead of single one ? Since you are stuffing bytes (like HLDC framing), you are already preventing flag sequence to occur within the content of the frame ?
It would be nice to get exemple that compile on Linux (gcc) or Windows (VSxxxx). Are you going to add one? If I've come to better understand having received your answers, maybe I could do that 🍻

Sorry to create an issue, I would have liked to contact you by email, but I did not find it anywhere.

Best regards, Martin

kentindell commented 6 years ago

A MIN ID is out-of-band information that allows a receiver to handle the contents differently. The idea is to have different IDs for different sensor data, another one for command messages, another for diagnostics, and so on. The min_id byte is also used to encode out-of-band signalling for the transport layer on top. So only application MIN IDs in the range 0-63 are permitted (and hence masking with 0x3f to ensure this): the top two bits are reserved for the transport layer.
0x55 and 0xaa are magic bytes that mark the start and end of frame (so anything else can be detected as an error). I picked them because they are a bit pattern of 01010101 and 10101010 which could be used to automatically detect the baud rate (because there's exactly 1 bit time between transitions).
You can't deduce payload length before you've received the whole frame and you don't know where the EOF without knowing the payload length.
The SOF is 3 bytes because that's the stuff length: 0xaa 0xaa 0xaa is the way to synchronise the SOF if the receive comes in half way through a frame. The reason the stuffing marker is 3 x 0xaa and not 2 is to keep the worst-case frame length (and hence static buffer allocation) moderate. And I didn't go with 4 because the header makes a small MIN frame proportionally much bigger and hence the overhead much more.
I designed this with the assumption that the embedded device runs C and the host is a PC type device running something like Python. I should really do a host version in C too. But the goal was to get a protocol that's scalable down to very small 8-bit AVRs and PICs with tiny amounts of RAM that periodically sends sensor data (e.g. a smart motor controller) to a PC host that's doing the UI and other control. I added later support for the transport layer that would run on something with a bit more resources like Cortex M ARM microcontroller.

No problem with it being raised as an issue: it's good to have questions in the open for others to see.

erakis commented 6 years ago

Hii @kentindell,

First, thank you for taking time to answer me.

So If I well understand, I could always use 0 as MIN ID for all frame with data and it should work properly ? If I misunderstood, could you give me a concrete example of the min-id purpose ? By the way, the only thing I need to exchange is (; data msg, ack, reset`). Moreover, what is the purpose the two other reserved bits ? It is to identify frame type (data vs ack) ?
Thank you.
I'm not sure to understand and I'm still curious. If we were reading byte per byte from a serial port, we would discard bytes for (escape/control char, SOF, EOF, checksum) and only keep in a buffer the frame's data bytes. So once you detect the SOF, you start discarding or stacking bytes until you read the EOF. Thus, what's left in the stack buffer, I mean the the number of bytes is your payload length ? So there is no need to exchange the payload length in the protocol header ? In the HDLC protocol, the payload length is not exchange. Is there an advantage to exchange the payload length ? Unless there is a subtlety that I do not understand with the MIN protocol ?
What do you mean by stuff length ? If we compare to the HDLC protocol, the SOF and EOF are the same octet. But they use only one flag of 1 bytes and the stuffing is only apply when the flags (SOF) appears up to the last SOF that mean an EOF. If we were start to read in half way through a frame, we would skip any octet until we read the SOF. Like this
I'm currently converting you work to C++. I will move the context as the class member and a class will handle a single context. The only thing I plan to change right now, is the global ring buffer, there will be one in each class (as member) instead of a global one for all context/class.

Sorry for these new questions. Best regards,

kentindell commented 6 years ago

Yes, that would work fine.
The EOF character isn't anything special except as a fixed value to do a cross check that something hasn't gone wrong. It's not a framing character: the length field is used as a countdown to read the right number of bytes. Technically you could skip the EOF character completely. It might be better to do that as part of a revision to the CRC mechanism (see below).
Yes, that's effectively a stuff length of 1. But you see that in the worst case the amount of data doubles. I wanted to ensure that the system can be analyzed for worst case so that everything can be statically bounded (buffer sizes, compute times, transmission time). So I chose a stuff length of 3 to mean that the worst case payload 'only' increases by 33%.

Since I designed this protocol I came across Phil Koopman's work on CRCs and it turns out that a CRC with a variable offset in a frame is a bad idea: anything that corrupts the length can cause the Hamming Distance calculations of the CRC to be completely undermined. See slide 55 of:

Data Integrity Techniques: Aviation Best Practices for CRC & Checksum Error Detection

This problem applies to a lot of existing protocols (not just CAN). So I need to revise this protocol and put a small CRC on the header including the length - just as FlexRay does - and then a second CRC at the end of the frame. And I need to choose a better polynomial (Phil Koopman's team did an amazing job with brute force evaluation of all polynomials of all CRCs up to 32 bits and made a searchable database of the results).

erakis commented 6 years ago

Hi @kentindell ,

Thank you so much for you explanation and your help.

I'm currently using your protocol and I'm wondering how to re-synchronize two device (dev1, dev2) in these cases.

dev1 is synchronized with dev2 and they exchange since multiple hours. While dev1 is sending something each secondes, suddenly the dev2 goes offline for ever. Now the dev1 will try to re-transmit the same frame until it get the ACK for. At the same time it will also accumulate a new frame to send each seconds until the window has no longer space. Thus, to detect a communication failure, I could use a counter when re-transmitting and if this timer elapsed (ex: 1 sec) I could push a message or use callback to signaled a failure, after what I reset the dev1 and try to send a reset to the remote. Do I'm right ?
dev1 is synchronized with dev2 and they exchange since multiple hours. While dev1 is sending something each secondes, suddenly the dev2 get disconnected for some seconds. Now the dev1 will try to re-transmit the same frame until it get the ACK for. But the dev1 do a local reset and failed transmitting the reset to the dev2. Now when the connection will reestablished, dev1 will try to send an initial message to dev2 but dev2 will not have reset his sequence number like dev1, so each time dev1 will send him a packet, the dev2 will ignore it (_sequence_mismatchdrop++, or ++spuriousACKs). So on the dev1 side, we will have to wait for the retransmission timer to elapse and finally send a transport reset command to dev2. Is it OK ?

I'm not sure to understand why no ACK are sent when we detect an idle ? If other side come back and we are not sending an ACK, how we can detect that remote came back ?

// Periodically transmit the ACK with the rn value, unless the line has gone idle
if(now - self->transport_fifo.last_sent_ack_time_ms > TRANSPORT_ACK_RETRANSMIT_TIMEOUT_MS) {
    if(remote_active) {
        send_ack(self);
    }
}

When debugging intensively for disconnection I sometime triggered the assert, but I'm now really sure about them. Sometime I'm getting this one
```
#ifdef ASSERTION_CHECKS
assert(window_size > 0);
assert(window_size <= self->transport_fifo.nframes);  // <---- This one
#endif
```
If I'm wrong on 1), 2), I'd like to know your strategy for re-synchronization.
Have you used this protocol on existing products?

Best regards, Martin

kentindell commented 6 years ago

If the assertion checking is failing then something is definitely wrong: it should never trigger. But the transport protocol wasn't really designed to handle the higher level network status of a device going offline and coming back. Are your devices running without a reset for this length of time? If they are and it's the communication only that fails then something is wrong. I put a soak test together using an Arduino board but unfortunately the drivers for USB on Arduino are buggy and they crash, causing a loss of communication, so the soak test wouldn't run for the several days that I wanted.

Having said this, the transport protocol wasn't really designed to handle the network management part like TCP/IP does (where there is a complex state machine and timeouts for synchronising the status in case a connection drops). So if this can happen then the best way to handle this is to use a regular non-transport MIN frame in each direction to provide a heartbeat signal and where a timeout from a loss of the heartbeat causes both sides to reset.

Finally: yes, I've used this inside a network bus analyzer tool. I embedded this MIN implementation inside the MicroPython firmware on a PyBoard and I have a small monitoring process running. I use the monitor and MIN transport to pass up to a host PC the CAN frames the devices sees, and also for the host to pass down CAN frames it wants to send. The system is used as a test harness to run CAN network tests (some of these tests run for hours, sending thousands and thousands of CAN frames. In fact, I wrote the transport layer specifically for this system: the network analyzer cannot lose data because this would cause my automated CAN network tests to fail (and of course a network analyzer that loses data is a useless tool).

erakis commented 6 years ago

HI @kentindell

I have thoroughly studied your protocol and I think I have found a solution for re-syncronization. The idea is to use the reset command as a re-synchronize facility. Please, be these two serial device.

dev1 is connected to dev2 and communication goes along since many hours. every second they send each other the time using a 64-bit variable.
Suddenly the communication between them are lost. Whatever the reason, power down or communication failure like too much noise, etc...
dev1 does no longer received ACK from dev2 and start accumulating time changed until its sliding windows run out of space. He also start a no response timer (on applicative side) that will be used to determine when dev1 will consider having lost communication with dev2.
Same behavior as 3) for dev2 (if it's still online).
The no response timer of dev1 elapsed. dev1 is doing a transport reset and try sending a reset to dev2. While doing the local transport reset, dev1 assigns him an invalid sequence transport_fifo.rn = 0xff. This way, dev1 can no longer accept any transport frame except RESET or non-transport one.
Now pretend the dev2 come back of a communication failure try sending using it's old sequence number then the dev1 will refuse frame as it is not possible to use 0xff as sequence number. The only way to get the dev1 accepting back frame is to received a RESET from dev2.

So the main idea is that until a RESET command is sent between both, the 2 party can not talk to each other with transport frames.

kentindell commented 6 years ago

Yes, that looks workable. You could also just exchange regular MIN messages as the timeout heartbeat, which would be lower overheads than using the transport layer.