mysensors / MySensors

MySensors library and examples
https://www.mysensors.org
1.3k stars 891 forks source link

[Bug] ESP8266MQTT-Gateway is too fast for 1Mhz battery nodes #578

Closed cimba007 closed 7 years ago

cimba007 commented 7 years ago

As discussed here I propse an delay(2) for the gateway upon responding to an I_PING request.

This might not yet be a whole solution as it would possible affect other message-response operations too.

tekka007 commented 7 years ago

Should probably add a delay of ~300ms to be on the safe side (applicable to all HWs)

cimba007 commented 7 years ago

Beware that a hard delay might block the gateway for this time .. no new messages could be proccessed.

As I am currently the only one facing the problem (at least a read nothing else) I would suggest just to keep the issue in mind when developing an RPI Version.

As my only problem was with the I_PING request it might actually effect software-ack and other messages as well.

It would be best if some1 else could confirm this issue. If I have the time today or tomorrow I might hook up my logic analyzer and use output-pins instead of serial messages to verify the timing.

cimba007 commented 7 years ago

I think I ran into the same problem with the software autoack feature .. can only receive the ack sporadically .. Seems to be time for some precise measurements.

cimba007 commented 7 years ago

So .. now some real field-data:

D5-node = low while the sensor-node is not able to receive messages. It goes high after stopListen and low after startListen.

LOCAL bool RF24_sendMessage( uint8_t recipient, const void* buf, uint8_t len ) {
    uint8_t status;

    RF24_stopListening();
    digitalWrite(5,LOW);
    RF24_openWritingPipe( recipient );      
    RF24_DEBUG(PSTR("RF24:send message to %d, len=%d\n"),recipient,len);
    // flush TX FIFO
    RF24_flushTX();
    // this command is affected in clones (e.g. Si24R1):  flipped NoACK bit when using W_TX_PAYLOAD_NO_ACK / W_TX_PAYLOAD
    // RF24_spiMultiByteTransfer( recipient==BROADCAST_ADDRESS ? W_TX_PAYLOAD_NO_ACK : W_TX_PAYLOAD, (uint8_t*)buf, len, false );
    // we are safe by disabling AutoACK on the broadcasting pipe
    RF24_spiMultiByteTransfer( W_TX_PAYLOAD, (uint8_t*)buf, len, false );
    // go, TX starts after ~10us
    RF24_ce(HIGH);
    // timeout counter to detect HW issues
    uint16_t timeout = 0xFFFF;
    do {
        status = RF24_getStatus();
    } while  (!(status & ( _BV(MAX_RT) | _BV(TX_DS) )) && timeout--);
    // timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles
    RF24_ce(LOW);
    // reset interrupts
    RF24_setStatus(_BV(TX_DS) | _BV(MAX_RT) );
    // Max retries exceeded
    if( status & _BV(MAX_RT)){
        // flush packet
        RF24_DEBUG(PSTR("RF24:MAX_RT\n"));
        RF24_flushTX();
    }

    RF24_startListening();
    digitalWrite(5,HIGH);
    // true if message sent and not timeout
    return (status & _BV(TX_DS) && timeout);
}

D4-node= HIGH right before the I_PING is sent by the sensor node and low after the ping is sent and just before the wait function is called

        digitalWrite(4,HIGH);
        transportRouteMessage(build(_msgTmp, _nc.nodeId, targetId, NODE_SENSOR_ID, C_INTERNAL, I_PING, false).set((uint8_t)0x01));
        digitalWrite(4,LOW);
        // Wait for ping reply or timeout
        //Serial.print("WAITING");
        transportWait(2000, C_INTERNAL, I_PONG);

D5-gateway = the same as D0 but on the gateway. This means if D1 = high the sendig process is finished and the gateway is listening again.

So .. what can we see?

image

The gateway is starting a send operation at 4ms .. that is 1ms before the node has finished the preparation to go to listen mode again.

ZOOOM

image

Or exaclty 888µS before the sensor-node is listening .. ofc. the message is not yet send by the gateway but assuming it is definitely send when D5-gateway goes high this indicates there is a highly likely change for a missed I_PONG by the node.

With this finding I propse to introduce an delay of 5ms. That should be plenty but it would be needed everytime the gateway is sending a response to an request from the node.

This is not needed if the gateway sends a normal message from the controller as the node should already be in listen mode.

To make this option flexible I propose to add a switch like

GATEWAY_ADDITIONAL_DELAY 5

For now this should add an delay at the following points:

Before an I_PONG message is sent:

if (type == I_PING) {
    //delay(5);
    debug(PSTR("TSP:MSG:PINGED (ID=%d, hops=%d)\n"), sender, _msg.getByte());
    transportRouteMessage(build(_msgTmp, _nc.nodeId, sender, NODE_SENSOR_ID, C_INTERNAL, I_PONG, false).set((uint8_t)0x01));
    return; // no further processing required
}

Before an ack-response is send

// Check if sender requests an ack back.
if (mGetRequestAck(_msg)) {
    _msgTmp = _msg; // Copy message 
    mSetRequestAck(_msgTmp, false); // Reply without ack flag (otherwise we would end up in an eternal loop)
    mSetAck(_msgTmp, true); // set ACK flag
    _msgTmp.sender = _nc.nodeId;
    _msgTmp.destination = sender;
    // send ACK
    debug(PSTR("TSP:MSG:ACK msg\n"));
    //delay(5)
    // use transportSendRoute since ACK reply is not internal, i.e. if !transportOK do not reply
    transportSendRoute(_msgTmp);
} 

Definitiv on the registration request too

// sender is a node
if (type == I_REGISTRATION_REQUEST) {
    #if defined(MY_GATEWAY_FEATURE)
        // register request are exclusively handled by GW/Controller
        // !!! eventually define if AUTO ACK or register request forwarded to controller
        #if !defined(MY_REGISTRATION_CONTROLLER) 
            // auto register if version compatible
            bool approveRegistration = true;

            #if defined(MY_CORE_COMPATIBILITY_CHECK)
                    approveRegistration = (_msg.getByte() >= MY_CORE_MIN_VERSION);
            #endif
            #ifdef MY_GATEWAY_ADDITIONAL_DELAY 5
                delay(MY_GATEWAY_ADDITIONAL_DELAY);
            #endif
            _sendRoute(build(_msgTmp, _nc.nodeId, _msg.sender, NODE_SENSOR_ID, C_INTERNAL, I_REGISTRATION_RESPONSE, false).set(approveRegistration));
        #else
            return false;   // processing of this request via controller
        #endif
    #endif  
}

I don't know yet if there are other situations where a delay might be needed. Assuming the node requests an sensor value via "request" from the congtroller the delay should be high enough as the controller needs some time for processing too.

Regarding an raspberry pi implementation further testing would be needed. The RF24 datasheed could be taken into account to calculate the minimum "ON-AIR" time of the shortest possible package with the slowest data-rate.

Also note that this might affect the RF*69 modueles too so the delay might be different there.

As this is a very special case the implementation of an "variable" delay might be on optimal solution.

cimba007 commented 7 years ago

From my recent experiemtns regarding the needed delay I can say that

1ms is not enough 2ms is ok 3ms is ok too (no big difference)

To take the measurements (previous post) I used

Hardware: http://www.ebay.de/itm/CY7C68013A-56-EZ-USB-FX2LP-USB-Develope-Board-Module-Logic-Analyzer-A8-/131721242177?hash=item1eab32a641:g:BWAAAOSwpzdWtWe4

Software: https://sigrok.org/

Total cost like 5$, open source and working on windows ;-)

tekka007 commented 7 years ago

@cimba007 Very good analysis, thanks for your effort. Actually, the issue seems a combination of different factors, and the culprit seems to be the timeout counter in the polling loop:

int16_t timeout = 0xFFFF;
    do {
        status = RF24_getStatus();
    } while  (!(status & ( _BV(MAX_RT) | _BV(TX_DS) )) && timeout--);
    // timeout value after successful TX on 16Mhz AVR ~ 65500, i.e. msg is transmitted after ~36 loop cycles

This timeout counter has been introduced to prevent a deadlock in case of HW issues. Now, like in your combination with a fast GW and a very slow node, the timeout is reached after GW tx but before node rx (as you've shown). Also, the timeout prevents the full execution of the auto re-transmit feature on RF24 nodes with 15 retries and 1500us retry-delay (last tx from the GW would be after >22.5ms)

I suggest you try these these two changes and if possible, upload the analyzer graph:

  1. set timeout variable to unit32_t timeout = 0xFFFFFFFF; // this is only an intermediate test as RPI are even faster
  2. remove timeout completely
cimba007 commented 7 years ago

Like I said I don't use the hardware autoack at all as it is not working - even with different librarys.

LOCAL void RF24_setNodeAddress(uint8_t address) {
    if(address!=AUTO){
        MY_RF24_NODE_ADDRESS = address;
        // enable node pipe
        RF24_setPipe(_BV(ERX_P0 + BROADCAST_PIPE) | _BV(ERX_P0) );
        // enable autoACK on pipe 0
        //RF24_setAutoACK(_BV(ENAA_P0) );
    }
}

I have not checked yet but my nodes should spend minimal time in the timeout-loop as the only condition they should wait for is _BV(TX_DS).

I reported my issue here https://github.com/TMRh20/RF24/issues/264 but it was called off as being counterfeight related. My simple testsketch with TMRh20's library indicates that even with 2x 8MHz nodes and 250kbps the performance with hardware autoack was terrible and with 1mbps it was fine but with lower range.

I will try to change the timeout-value or remove it completely but I don't expect an noticable impact.

First I will try to output the timeout value on the serial to get an idea if it might even be linked to the problem. .. standby ..

cimba007 commented 7 years ago

timeout = 65533 .. so only 2 cycles passed .. this is pretty fast.

I even tried to lower the SPI.clock of the esp8266 but this seems to either not work, or bugged or .. whatever

void before()
{
  SPI.setFrequency(100000);
}

Just for a visual comparison of the difference in speed

image (top = node, buttom = gateway)

Hell even the digitalWrite functions take lots of time :D

Option 1: Make the slow node faster (use direct pin manipulation and not (slow) digitalWrite) Option 2: Make the gateway slower (spi freq setting didn't work?!) Option 3: Add selective delay to the gateway .. Option 4: https://code.google.com/archive/p/digitalwritefast/

More to Option4:

the nrf24 library now got A LOT faster

Before: image

After: image

Remember the 5ms from before? Speeding up CSN and CE ..

Now I "might" not even need to add the extra delay anymore. Faster execution time of a send will even benefit batterytime (in theory).

Note: I have not yet tested long term effects of digitalWriteFast

Note2: If the raspi is using full speed even this optimization might not be enough and sticking to option3 (delays) might be needed.

tekka007 commented 7 years ago

timeout = 65533 .. so only 2 cycles passed .. this is pretty fast.

This is the case for AutoACK disabled - TX_DS is immediately asserted. Can you run the same test with AutoACK enabled? No clue why your AutoACK is not working - maybe you really have bad counterfeits...

cimba007 commented 7 years ago

Have a look at this thread: https://github.com/TMRh20/RF24/issues/264#issuecomment-224708822

What you can't see in the post .. everytime RPD is 0 the retrycounter is somehwere between 0 and 15 .. this was not acceptible for a distance of only a few cm.

I have wasted hours and hours, weekends and weekends with this damn hardware autoack but I am no longer willing to even check it. I disabled it and now I am running with software auto-ack .. sry for this attitude but I already wasted too much time with testing and checking the autoack.

Even if the autoack is working and without my fix the the first few packets might be lost cause the timing problem.

PS: The autoack not working is most likeley due to the counterfeits.

It is not working with

tekka007 commented 7 years ago

@cimba007 I tested your configuration using a ESP8266 GW (MQTT, no debug, at 80 and 160Mhz) and AVR@1Mhz node, all nRF24L01+ clones, MySensors 2.0.0 vanilla (i.e. AutoACK enabled). - I cannot reproduce the communication issue you observed (for now), will run additional tests.

Nevertheless, implementing digitalwritefast (if portable) is something to consider

Do you have different batches of nRF24L01+ modules to test?

cimba007 commented 7 years ago

I suspect that the autoack-feature is masking the first few lost packages.

You can try to use RF24_readByteRegister(OBSERVE_TX) & 0x0F; (upper byte should be only the total "failed" count since last channel switch.

It should be ideally called just after your write request of I_PONG on your gateway.

uint8_t transportPingNode(uint8_t targetId) {
    if(!_transportSM.pingActive){
        if(targetId == _nc.nodeId) {
            // ping to ourself, pingActive remains false
            return 0;
        }
        _transportSM.pingActive = true;
        _transportSM.pingResponse = INVALID_HOPS;
        debug(PSTR("TSP:PING:SEND (dest=%d)\n"), targetId);
        //digitalWrite(4,HIGH);
        transportRouteMessage(build(_msgTmp, _nc.nodeId, targetId, NODE_SENSOR_ID, C_INTERNAL, I_PING, false).set((uint8_t)0x01));
        //digitalWrite(4,LOW);

        // Wait for ping reply or timeout
        //Serial.print("WAITING");
        transportWait(2000, C_INTERNAL, I_PONG);
Serial.print("OBSERVE_TX: "); Serial.println(RF24_readByteRegister(OBSERVE_TX) & 0x0F);

        // make sure missing I_PONG msg does not block pinging function by leaving pignActive=true
        _transportSM.pingActive = false;
        return _transportSM.pingResponse;
    }
    else {
        return INVALID_HOPS;
    }
}

The code is untested but should give you some idea about the number of retries the gateway used on this particular message.

The counter (lower nibble) will only be reset on start of next transmission.

Protip: This could be used to derive some kind of RSSI equivalent ;-)

RalfJL commented 7 years ago

Yep, same problem here, very different setup. All nodes are running with RFM69HW radio's Gateway: Arduino Nano 16 MHz serial gateway; debug mode; 115200 baud. Controller is of no importance here but for completeness I use a pimped mqttGateway2.pl that can also accept controller responses. Node: 8Mhz no bootloader sensor on LiPo; debug mode 38400 baud-> NO problem Mode: 1Mhz no bootloader sensor on 2 AA batteries; debug 9600 baud -> a lot of timing problems

Almost every "immediate" answer to the 1Mhz node is too early. e.g. I_FIND_PARENT -> I_FIND_PARENT_RESPONSE I_PING -> I_PONG even the response to I_TIME is too fast A delay of 300ms in the according section solves that and the 1Mhz node works. Please do not understand me wrong. I am NOT suggesting to add 300ms delay to all theses sections. There should be something like an adaptive wait to cover other timings as well. Currently I have no good suggestion.

Please also note that, for me, it is very important to run on 1MHz. With 1MHz the voltage can go down as deep as 1.8V and that makes very good use of 2 AA batteries. I plan to build a bunch of water sensors and spread them across my house to be warned if there is any leakage (had already 2 incidents of that kind). This makes most sense, if you can run them for 2 or 3 years on one set of batteries. 4MHz could go down to 1.8V as well, but that means adding 1 crystal and 3 capacitors and currently it is not for sure that 4MHz would work

cimba007 commented 7 years ago

I didn't read your whole post but will do this tomorrow. Here my current working solution:

I use esp8266 as mqtt-client-gateway

On the gateway:

define MY_GATEWAY_ADDITIONAL_DELAY 3

Before the relevant _sendRoute (response) in the mysensor library (e.g. if (type == I_REGISTRATION_REQUEST) { or I_PING request)

                #ifdef MY_GATEWAY_ESP8266
                    #ifdef MY_GATEWAY_ADDITIONAL_DELAY
                        delay(MY_GATEWAY_ADDITIONAL_DELAY);
                    #else
                        #ERROR PLEASE DEFINE MY_GATEWAY_ADDITIONAL_DELAY
                    #endif
                #endif

This code will not break or alter any sketches that don't use the MY_GATEWAY_ESP8266 so it should be safe to add.

My experiment show that using digitalWriteFast for CE and CSN AND using 3ms delay is good enough in all cases. Even going as low as 1ms delay with digitalWriteFast solved all my issues.

RalfJL commented 7 years ago

P.S. removing DEBUG on the 1MHz node allows the FIND_PARENT conversation but is still to slow for PING/PONG

@cimba007 you saw that I am using a Arduino nano and a RFM69 radio. Did not dive to deep now, but digitalwritefast is for SPI, so RFM69 would profit as well, right?

cimba007 commented 7 years ago

digitalWriteFast is only for normal digitalWrite

It makes digitalWrite a lot fastr but the pin you want to change must be known on compile time

´´´c LOCAL void RF24_csn(bool level) { //digitalWriteFast(MY_RF24_CS_PIN, level);
digitalWrite(MY_RF24_CS_PIN, level); }

LOCAL void RF24_ce(bool level) { //digitalWriteFast(MY_RF24_CE_PIN, level); digitalWrite(MY_RF24_CE_PIN, level); } ´´´

I replaced the digitalWrite with the digitalWriteFast. The SPI communication can't be changed and depends on the CPU speed (1mhz).

the *fast function is only commented out for testing purpose

RFM69 would only profit everywhere "digitalWrite" is used on the RFM69.cpp.

However .. in the first place I would recommend adding delays .. you would have to experiment as the delay on the RFM69 might be different.

To further "optimize" .. let the node stay awake less time I tried the digitalWriteFast

RalfJL commented 7 years ago

What do you think about a smart_wait()? The idea is to increase a wait value if a transmission failed and reduce it over the time again. This would also help to resolve collisions that occur when two nodes wait the same time to retransmit.

So the ping code would look like this (the FPAR code similar):

 smart_wait();
 if ( !transportRouteMessage(build(_msgTmp, _nc.nodeId, sender, NODE_SENSOR_ID, C_INTERNAL, I_PONG, false).set((uint8_t)0x01))){
      smart_wait(MY_SMART_WAIT_INC);
  }

And the 2 smart_wait functions like this:

static int smart_wait_time = 0;
void smart_wait(){
        // add some randomness to the delay to resolve collisions
        delay(smart_wait_time + (hwMillis() & 0xff));
#if defined(MY_GATEWAY_FEATURE)
        // reduce slowly
        if ( smart_wait_time > 0 ) smart_wait_time--;
#endif
}

void smart_wait(int increment){
// on gateways we increment the delay because of slow sensor nodes
// nodes can use simple random wait times
#if defined(MY_GATEWAY_FEATURE)
        if ( smart_wait_time < MY_SMART_WAIT_MAX){
                smart_wait_time += increment;
                debug(PSTR("TSP:MSG:SMART_WAIT INC (VALUE=%d)\n"), smart_wait_time);
        }
#endif
}

And the definitions:

/**
 * @def MY_SMART_WAIT_INC
 * @brief number of milli seconds the delay time between reception and sending will be increased on unsuccesfull reply 
 */
#ifndef MY_SMART_WAIT_INC
#define MY_SMART_WAIT_INC 20
#endif

/**
 * @def MY_SMART_WAIT_MAX
 * @brief max number of milli seconds the delay time between reception and sending will be increased on unsuccesfull reply 
 */
#ifndef MY_SMART_WAIT_MAX
#define MY_SMART_WAIT_MAX 1500
#endif
RalfJL commented 7 years ago

But smart_wait() did not help alone in case of the RFM69. I found that the acknowledge to any package was sent too fast. So even a regular message like the update of a voltage was hard to receive.

The Solution: In myTransmissionRFM69,cpp I had to add a delay between reception and sending the ack. 1 milli was not enough, 2 millis hardly, 3 millis did the trick with my 16MHz Arduino nano. So the code now looks like this:

 uint8_t transportReceive(void* data) {
        memcpy(data,(const void *)_radio.DATA, _radio.DATALEN);
        // Send ack back if this message wasn't a broadcast
        if (_radio.TARGETID != RF69_BROADCAST_ADDR)
                _radio.ACKRequested();
                delay(3);
                _radio.sendACK();
        return _radio.DATALEN;
}    
cimba007 commented 7 years ago

I thought about the idea of a function like smart_wait() too! In the end I dropped the idea. Having a dynamic delay would make the code work "on chance". Reproducing and debugging might become much harder as it will "sometimes" work (when the delay is high enough) and sometimes it would not work.

I think the hwMillis() & 0xFF is too much! 0x0F might be acceptable but why should we add a delay when it is not needed? Maybe use something like this:

#define MY_GATEWAY_ADDITIONAL_DELAY 4
...
delay( smart_wait_time > 0 ? smart_wait_time + (hwMillis() & 0x0f) : MY_GATEWAY_ADDITIONAL_DELAY)

instad of

delay(smart_wait_time + (hwMillis() & 0xff));

This would add a random delay only if the smart_delay was engaged for any reason. Thus the smart_delay would deal with random collisions and with "slow_nodes".

The code in your latest post looks wrong to me.

  1. you forgot the curly brakes .. the if will only work on the NEXT line .. so the ack_requested .. and thus be pretty useless as even ackrequested would do nothing but return a true/false
        if (_radio.TARGETID != RF69_BROADCAST_ADDR)
                _radio.ACKRequested();
                delay(3);
                _radio.sendACK();

Maybe you wanted something like this?

        if (_radio.TARGETID != RF69_BROADCAST_ADDR)
        {
               // The targetid is not broadcast .. so replay with an ack if needed
                if(_radio.ACKRequested())
                {
                     // Wait before sending ACK
                     delay(3);
                     _radio.sendACK();
               }
        }

This code might still be wrong as I currently have no RF69 in use.

tekka007 commented 7 years ago

A dynamic/adaptive wait() is certainly interesting; what about node-type specific delays (e.g. using a RAM/E2P table to keep track of battery-powered nodes), i.e. if destination is a battery-powered node execute delay before replying to FPAR_REQ, ID_REQ, PING, etc. This is beneficial for large setups with a variety of nodes.

RalfJL commented 7 years ago

I thought about some kind of memory too. And best would be, if the node could tell it's speed to use some default parameters with "slow" nodes. But is the memory large enough to hold up to 255 nodes? And than these parameters are different for every node-gateway-transmitter combination and I think that matrix is a pain. In my opinion it would be perfect, if the gateway and maybe the nodes can dynamically adjust to any situation. The question that comes to me is, if smart_wait() can handle every situation. My 1Mhz node (debugging enabled) pushes the limit easily to the max (1500 millis) limit because it is trapped in some kind of loop, when no one is listening on the serial line. When I start the IDE on the serial line the node works great and the delay is not increased.

cimba007 commented 7 years ago

Is your node by change a leonardo or other kind of atmega32u4? Most default sketches use something like this:

  while (!Serial) {
    ; // wait for serial port to connect. Needed for native USB port only
  }

Remove this code if you don't intend to connect the usb-cable.

Regarding the smart_wait() .. maybe the solution should not bee too complicated. One idea is that the every node passed some more parameters at registration on the controller/gateway. Like designeted SPI-bus speed or some number signaling the required delay of the node.

This information is then stored in some kind of binary-structure.

32x8bytes is enough space to store a flag for every possbile node: "slow-mode" .. if a finer solution is needed maybe 2x32*8bytes are enough .. = 64byte with 2 bits per node.

This would be enough to distinguish 4 different speed levels.

In addition to that .. the combination of nodes should be not too big.

Assume a gateway is either normal arduino @ 16mhz (no need to clock down), maybe some kind of esp8266 or an raspberry pi .. but in the end the SPI-Bus speed which is mostly responsible for the problems is in most cases between 500khz (1mhz node) and maybe 2-4mhz (esp8266). I think even an raspberry pi should stay somewhere near the default arduino SPI-bus speed for the sake of compability.

Having the smart_wait() amount fixed is inevitable in my oppinion. For the gateway there is little to no way to know if a packetloss is caused by the speed difference or by the bad reception if he doesn't have the information about the other node in some lookup table.