Multinode connection issues

Doubletop12 commented 1 year ago

As requested, opening as a new issue.

If I am running more than 5 nodes in a mesh with an RPi as the master and Arduino as the slaves on restarting everything I will have one of the nodes struggling to connect. It may occasionally get in but soon drops out again and then can't reconnect. Deleting dhcplist.txt on the RPi fixes the problem. Subsequent restarts and the problem comes back, but maybe not immeadiatley.

I have seriously considered having the code delete dhcplist.txt on startup, but felt it was a retrograde step. The dhcplist is there for a reason.

Pete

2bndy5 commented 1 year ago

Again, I ask: Do you know if the node struggles on net level 1 more than a higher level?

I don't think the problem is with the dhcplist.txt because there is a known problem in the mesh handshake algorithm (see doc'd connecting process and it's points of failure).

Doubletop12 commented 1 year ago

To answer @2bndy5's questions in #137

Do you know if the node struggles on net level 1 more than a higher level?

I'm not sure if it is the lowever level node that struggles but have assumed that it could be the case I'll check

It might also help to know the exact hardware used (eg RPi variant, Arduino variant, if radio module is PA/LNA or not, etc)

Raspberry Pi 2 Model B Rev 1.1 Debian_version 11.6 python3 --version 3.9.2 RF24 et al installed 30/11/22 (I see there have been updates since)

Arduino dev environment VSC with platformio mix of ESP32 and AStar 32U4

[env:a-star32U4]
platform = atmelavr
board = a-star32U4
framework = arduino

[env:esp32dev]
platform = espressif32
board = esp32dev
framework = arduino

Both devices suffer the problem

RF24 are nRF24L01Plus clones with power supply capacitor mod

Doubletop12 commented 1 year ago

I will now investigate node adressess when they fail and then look at implemening any updates that have occured since November

2bndy5 commented 1 year ago

I suspect that your master node is running slower than the connecting nodes. So much that the connecting nodes stop listening for a response from master before master is able to respond, thus behaving like the handshake process failed. I'm not sure how deleting the dhcplist.txt helps this, but it may have something to do with the RPi read/write speed (file system throughput).

Doubletop12 commented 1 year ago

The autoretry is set to 1500us and 5 attempts, that is 7.5ms That is quite a large window.

I am having another look at this right now. Nothing better than writing down the problem and someone throwing questions at you to reset your thinking.

2bndy5 commented 1 year ago

AutoAck (and consequently AutoRetries) isn't used for the handshake with nodes on level 1.

Doubletop12 commented 1 year ago

I have been looking at this most of the day and following up on the timing suggesting reduced some of the load on the Rpi. Along with some other tweaks. It did appear to make some difference but the problem, or maybe something similar, resurfaced at times. I now have instrumented the code and got the logic analyser checking the timing. I am still trying to reconcile what I am seeing.

More tomorrow

Doubletop12 commented 1 year ago

AutoAck (and consequently AutoRetries) isn't used for the handshake with nodes on level 1.

Do you mean the master doesn't Autoretry or the remotes don't? I was referring to remotes. If the problem was the master being too slow and the remotes not getting an Ack then would they retry? I would have thought that with async remotes clashes were inevitable so autoretry would be necessary?

2bndy5 commented 1 year ago

If you want to disable dhcplist.txt usage all together, you could simply apply the following patch

--- a/RF24Mesh.cpp
+++ b/RF24Mesh.cpp
@@ -452,7 +452,7 @@ void RF24Mesh::setAddress(uint8_t nodeID, uint16_t address, bool searchBy)
 void RF24Mesh::loadDHCP()
 {

-    #if defined(__linux) && !defined(__ARDUINO_X86__)
+    #if defined(__linux) && !defined(__ARDUINO_X86__) && 0
     std::ifstream infile("dhcplist.txt", std::ifstream::binary);
     if (!infile) return;

@@ -474,7 +474,7 @@ void RF24Mesh::loadDHCP()

 void RF24Mesh::saveDHCP()
 {
-    #if defined(__linux) && !defined(__ARDUINO_X86__)
+    #if defined(__linux) && !defined(__ARDUINO_X86__) && 0
     std::ofstream outfile("dhcplist.txt", std::ofstream::binary | std::ofstream::trunc);

     for (int i = 0; i < addrListTop; i++) {

This might help isolate the cause a bit more.

Another interesting point is where the connecting node decides to stop listening for an assigned address: https://github.com/nRF24/RF24Mesh/blob/105386baa2a13e24e595c1bd4f173b353b6c9f26/RF24Mesh.cpp#L357-L359 and when the connecting node decides to stop polling a certain network level (then try the next higher level): https://github.com/nRF24/RF24Mesh/blob/105386baa2a13e24e595c1bd4f173b353b6c9f26/RF24Mesh.cpp#L328-L329

Do you mean the master doesn't Autoretry or the remotes don't?

Concerning net level 1 (nodes 01-05), any handshake is done without AutoACK packets. Note, AutoAck is a required feature for AutoRetries, so it doesn't matter if it is the remote or the master; AutoAck settings need to match on both ends for it to work. This is the crux of doc'd point of failure (bullet point 2).

I would have thought that with async remotes clashes were inevitable so autoretry would be necessary?

OTA collision is handled differently depending on the context. For normal network operations, AutoAck is enabled -- except for

multicasts because a single ACK packet isn't indicative at all and causes OTA collision with others attempting to acknowledge (or forward the message to the appropriate level).
handshakes on net level 1; I think I'm repeating my self now.
explicitly network.write()ing to a node that is 1 hop away.

Keep in mind that the AutoRetries setting is re-configured on every call to network.begin(): https://github.com/nRF24/RF24Network/blob/2a7c942a66e10e3b7d0b09cd58c7b1a6427ff1b1/RF24Network.cpp#L107-L111 So, whatever you're telling it to do with the RF24 radio object, it is getting overridden when the mesh node is assigned a network address.

2bndy5 commented 1 year ago

Just to be painfully clear...

handshakes with AutoAck:

master <--> intermediate_node [<--> ...] <--> connecting_node

handshakes without AutoAck:

master <--> connecting_node

TMRh20 commented 1 year ago

If it is just one or two nodes having problems, and the hardware is roughly the same on other nodes, I would be looking specifically at the nodes with the problems, ie: potential hardware/power supply issue

What if you add radio.setPALevel(RF24_PA_MIN,0); after any calls to mesh.begin(); ? Is there any difference there?

It could also have something to do with the code running on the devices, but I think that is less likely.

Doubletop12 commented 1 year ago

Thank you both, a lot to look at here and consider where I need to start.

After thinking about this overnight I am going to do some more work with the logic analyser just to confirm that my code is doing what I intended it to do and then overlay your deeper level thinking to further qualify what is happening. I'll try to avoid copying code for review, this isn't a forum. I will see if there is a repeatable way to demonstrate what is happening.

Doubletop12 commented 1 year ago

Background

This application receives data from 7 remote nodes, it could be more, and processes the data and sends commands back to the nodes based on the received data. Data payload is 3 bytes each way.

Each remote node sends its data asynchronously every 0.5secs.

The Rpi loops listening on “network.available” it then loops receiving the data from the masters buffer through “network.read” The Rpi processes the data and every 1 second loops through the active node list sending the commands to the nodes through “mesh.write”

The application display has a panel that shows the state of the nodes. Greyed out if inactive, green if have been seen in the last 2 seconds, orange if last seen between 2 and 5 seconds, red if last seen between 5 and 10 seconds and greyed out if last seen over 10 seconds ago.

The problem has been nodes going missing for extended periods. See images Screen 1 and Screen 2.

Screen 1

Screen 2

The data in the log window Top row, the payload going to the node Second row Node ID, Address, seen or not seen, Payload(3 bytes status, error, check), command to the remote, Flag for Tkinter, Last ID

Investigation

A logic analyser was connected to 4 I/O pins which change state by code switching the pin with on/off calls placed either side of the code being investigated.

Yellow trace - is the top of the main loop toggling on each pass. Green trace – used for various samples, set high on entry and low on exit Blue Trace – Rx around “network.read” and the initial post processing. Set high on entry and low on exit Purple Trace - Tx around “mesh. write”. Set high on entry and low on exit

Example 1

Yellow the main loop. What is noticeable is the big gaps in the loop. At first this was thought to be Rpi OS overhead, but more on that later

X Loop is the code around the code executed every second that sends the commands to the active remotes. Incorporated in this the Tx “mesh.write” trace and shows that the application code provides little additional overhead to “mesh.write”

Blue Rx “network.read” in this case single entries

Purple Tx writes and showing the 6 sends to the 6 active nodes. Each positive transition is a write.

Example 2

Yellow the main loop.

X Loop this time the ‘wrapper’ around the code “while network.available”. which includes “network.read” indicating that, as expected, once the Rx buffers had all been read the code dropped out of the “network.available” loop.

Blue “network.read”, the broader timeframe shows the asynchronous nature of the time 6 remotes send their data.

Purple Tx writes again showing the 6 sends to the 6 active nodes. Each positive transition is a write. The varying duration of each frame is of interest? Again, there is a direct correlation between the “mesh write” timing and the gaps in the main loop. Is this OS overhead or not?

Example 3

A close up example of three cases being “network.read” in one “network.available” cycle

Example 4

The green trace is a indicates the point in time that a remote had not been seen for more than 2 seconds. (The pulse width is irrelevant; it has been set long enough to be seen). Again, this occurs during a sequence of “mesh.write”s. The overall interval of this sequence of writes is about 100ms and the longer pulse is in the order of 75ms. When the code is

        #send commands to remote units  
        tx_pin.on()                                                    
        ok = mesh.write(Unit[node][status.ADDRESS],bytearray(Command),0x7F) 
        tx_pin.off()                
        if ok:

It made me wonder what is happening in mesh.write?

Example 5

A wider view of the “mesh.write” issue.

Example 6

To confirm my suspicions about “mesh.write” I decreased the interval of the sending of the commands to the remotes from 1 second to 0.3 seconds. This had the effect of drastically increasing the number of missing remotes (green traces) and consequently decreasing the number of sends in each command cycle (purple). (this example is how I achieved the screen 2 image)

It is apparent that “mesh.write” is the culprit but now writing this I realise behind “mesh.write” is the code for the SPI interface and any impact the OS may have on that. However, there isn’t a similar impact to “network.read” and that uses the same interface.

So is the problem with “mesh.write”?

Doubletop12 commented 1 year ago

If it is just one or two nodes having problems, and the hardware is roughly the same on other nodes, I would be looking specifically at the nodes with the problems, ie: potential hardware/power supply issue

What if you add radio.setPALevel(RF24_PA_MIN,0); after any calls to mesh.begin(); ? Is there any difference there?

It could also have something to do with the code running on the devices, but I think that is less likely.

I have just about given up with the A-Star based units. They have proven unreliable. However, the power supply observation may be relevant. The ES32's use the on board 3.3V but the A-Star use a seperate regulator. There was probably a reason for that maybe I felt the on board 3.3v couldn't cope.

My code for both Rpi and Arduino have radio.setPALevel(RF24_PA_MIN,True), rather than radio.setPALevel(RF24_PA_MIN,0) after "mesh.begin" (which I acknowledge is different) . But I did find an instance in the Arduino versions where it was missing where the "mesh.begin" is used for recovery.

2bndy5 commented 1 year ago

Each remote node sends its data asynchronously every 0.5secs.

As soon as I saw this, I thought that interval was a bit tight for every mesh node to be transmitting, even more detrimental if your network is going to be rather large.

mesh.write() and network.read() aren't directly using SPI. Rather, mesh.write() does extra operation around calls to radio.write() (which is where SPI is used). network.read() doesn't use SPI at all because the RF24Network layer caches the received messages during network.available(); mesh.available() has a little overhead wrapped around network.available() that checks for system messages (messages used by the RF24Network/Mesh layers to implement network behavior). System messages include handshake operation and network ACKs (which are a software implemented ACK packets for network transactions).

So, the main blocking methods are mesh.write() and mesh.available(). If you need better speed, then RF24Network by itself (no mesh) is your answer. It has considerably less overhead than the mesh layer, but maybe the idea of static addresses is unappealing.

2bndy5 commented 1 year ago

So is the problem with “mesh.write”?

Using mesh.write() from a master requires network address lookup from memory. Using mesh.write() from a sibling node has to lookup the network address from master first (as long as the queried mesh id is not reserved for master), then uses the network address in a network.write().

Using the overloaded mesh.write() with a known network address (instead of a mesh id) does not have this extra lookup overhead.

Doubletop12 commented 1 year ago

From the analyser timings the receive period is on around 300us so plenty of time to get the message across in a 500ms cycle.

I was referring to SPI as the lowest level before hitting the hardware. “mesh.write()” may not do it directly but it would be dependant on it.

I am using “mesh.write()” because the eventual implementation will potentially require the element of resilience the mesh can provide. Distances could be a couple of hundred metres.

The 100ms for the 6 “mesh.writes” just doesn't seem right when compared to the other timings. it is orders of magnitude greater than other timings and extreme when a single mesh write takes about 1ms. Something must be wrong there to be holding the system up for that length of time.

2bndy5 commented 1 year ago

Like I said, the only overhead (other than that present in lower RF24* layers) is the network address lookup. It may also be competing with your sibling nodes that are transmitting every 0.5 secs.

Doubletop12 commented 1 year ago

I am using mesh.write(address...)

Like I said, the only overhead (other than that present in lower RF24* layers) is the network address lookup. It may also be competing with your sibling nodes that are transmitting every 0.5 secs.

Given the data rate of the 6 nodes sending 12 bytes at 1Mbs that's around 100us each or 600us in every 500000us time period. So I really can't see that clashing with the equivalent 12bytes going back in the same timeframes. It leaves 0.499s ( 49880us) for other stuff to happen. OK that is an oversimplification of the time available but serves to demonstrate that there is plenty of time available but mesh.write() is using up to100ms of every second thats 10% of the total time. There has to be something wrong with that?

TMRh20 commented 1 year ago

Sounds about right... There is overhead with RF24Mesh and address lookups if you are not just sending to the master.

#ifndef MESH_LOOKUP_TIMEOUT
    #define MESH_LOOKUP_TIMEOUT 135
#endif // MESH_LOOKUP_TIMEOUT

/** @brief How long RF24Mesh::write() retries address lookups before timing out. Allows multiple attempts */
#ifndef MESH_WRITE_TIMEOUT
    #define MESH_WRITE_TIMEOUT 115
#endif // MESH_WRITE_TIMEOUT

bool RF24Mesh::write(const void* data, uint8_t msg_type, size_t size, uint8_t nodeID)
{
    if (mesh_address == MESH_DEFAULT_ADDRESS) return 0;

    int16_t toNode = 0;
    uint32_t lookupTimeout = millis() + MESH_WRITE_TIMEOUT;
    uint32_t retryDelay = 5;

    if (nodeID) {
        while ((toNode = getAddress(nodeID)) < 0) {
            if (millis() > lookupTimeout || toNode == -2) {
                return 0;
            }
            retryDelay += 10;
            delay(retryDelay);
        }
    }
    return write(toNode, data, msg_type, size);
}

2bndy5 commented 1 year ago

12 bytes at 1Mbs that's around 100us each or 600us

This is a calculation that only makes sense in a perfect world. In my experience, it is rather impossible to accurately predict how long transmission time well be; there are other factors besides programming (like the physics of electromagnetic radiation).

In the end, I think your network would benefit from some clock syncing (to prevent siblings from TXing when master does) and a slightly more conservative bandwidth usage (siblings could dump 24 bytes every second instead of 12 bytes every half second).

Doubletop12 commented 1 year ago

Yes, I do understand. In a past life, I worked on real time software for military radar systems.

What it serves to demonstrate is that the perfect world scenario only requires 1.2ms and it has 500ms to get it right. That is over 400 attempts. Net syncing isn’t really necessary, there is plenty of time to get the data back and forth plus plenty of other 2.4Ghz sources.

After reviewing what is going on I have made a few tweaks to the code and here is another logic analyser trace.

DSView-230512-165017

Yellow - toggled on each main loop pass

Grey - “while network.available()” loop

Red - “net.read()”

Brown processing the data and making the decisions on the commands to go back to the remote. This only fires if the status message from the remote has changed from the last time.

Blue - 1sec update loop in which the:-

Purple - commands are sent to the remotes, Orange - the log files are written and the display updated. Green - the display updates

If you look at the distribution of the “network.read()” in most cases only a single read is being done when the “network.available() loop is running. Past examples have shown up to 3 reads in one window. Plenty of spare slots in the Rx buffer for more remotes and plenty of passes in the main loop to pick them up. Signals are only missed when the data is being processed or the 1 sec update loop is running.

The mesh.write()’s are now behaving themselves. There are still a number of longer writes but a lot are only 1ms. As the commands are sent to all the remote every second if one gets lost occasionally it does not matter, the same data will be resent within a second.

However, I have never been a fan of polling and feel that interrupts would be the answer. The real world (physics and hardware) and virtual world (software) can be left to get on with their own tasks. The software does not need to wait for the reaI world to deliver to it. I realise that the nRFL01 interrupts can only be used by the nRF24 code, are they? Using timer interrupts could give back those long pauses when waiting for timeouts. At the application level C++ or Python signals may provide an answer for the read/write methods with appropriate handlers passing the data back and forth between the application and the nRF24code.

I haven’t looked at the nRF24 source but guess you guys will have been through all this and there is good reasons for the way things have been done?

I’m going to leave this at this point, Thanks for your help, making me think has resulted in improvements that are acceptable for me. I haven’t made any major code changes just juggled things around a bit so the right things are done at the right time. I still get radios going missing temporarily but the come back pretty quickly.

I’ll wait for any feedback and then close the case.

2bndy5 commented 1 year ago

I personally haven't tried using interrupts with the network layers, but I guess you could track the IRQ pin with volatile bool(s) and call mesh.update() when it is asserted. Be advised that mesh.update() (which can call radio.write()) should not be called during an ISR. For more info on RF24 layer with IRQ, have a look at the RF24::maskIRQ() and RF24::whatHappened().

TMRh20 commented 1 year ago

However, I have never been a fan of polling and feel that interrupts would be the answer. I haven’t looked at the nRF24 source but guess you guys will have been through all this and there is good reasons for the way things have been done?

The history of RF24 begins with @Maniacbug who originally coded the RF24 & RF24Network libraries. After he 'disappeared' from GitHub, I came along and improved things, but left the functionality mostly the same. Then came RF24Mesh, which was intended as a simple overlay for RF24Network, to simplify things for users. Then came RF24Ethernet/Gateway, a TCP/IP layer for RF24 Networks.

In all that development, I thought the same thing, that interrupts would be the way to go optimally, but the code would require quite a lot of changes. In the end, I left things as-is, tried to implement interrupts at the higher layers like RF24Ethernet, but that never really worked out well. All in all you are probably correct, but the libraries basically need to be re-written to properly support interrupts. Feel free to submit a PR LOL.

When it comes to the new NRF52840 devices for example, which are completely software driven, we would have to get interrupts working on those as well to support RF24 libs, instead of just using polling.

In any case that's where we are at and where things have come from.

Doubletop12 commented 1 year ago

@2bndy5, Brendan @TMRh20; Again, thanks again for your help on this. To show you the problem I thought I had required me to get the logic analyser properly wired into the Rpi and between explaining what was going on and your feedback I found a lot of improvements to make my code slicker and ensure the necessary priority was given to the Tx/Rx. The nice to have stuff being pushed into the background. Even since my last post I have it running better.

And yes, the units that were giving the most problems are hardware issues.

Given the background to RF24 I can see a case for "if it ain't broke don’t fix it" so congratulations on the work you have been doing for us all.

I'll close the case now. Pete

Doubletop12 commented 1 year ago

I'll just add another observation. I had the remotes sending status every 1/2 second and the master sending commands to all the remotes every 1 second. Although none of the remotes are communicating with each, other than being in a mesh. These situations can result in everything eventually syncing up or at least slowly going in and out of phase with each other. I’ve seen this happen with radar systems. This could result in dropped messages as they clash.

The simple fix I have used is adding a jitter to the timing of the remotes sending their status. Rather than 500ms they are now 490ms +/- 10ms. I’ve decreased the master sends from 1sec to 900ms. This pseudo randomness should help with the potential synchronisation which could be the cause of remote going missing for extended periods.

No doubt there is a nice mathematical approach to determining the optimal settings but I think what I have done will help.

2bndy5 commented 1 year ago

Yeah we use a similar staggering idea when setting the AutoRetries (based on the assigned network address).

nRF24 / RF24Mesh

Multinode connection issues #222

handshakes with AutoAck:

handshakes without AutoAck: