gioblu / PJON

PJON (Padded Jittering Operative Network) is an experimental, arduino-compatible, multi-master, multi-media network protocol.
Other
2.73k stars 239 forks source link

GlobalUDP on ESP8266 floods the network #308

Closed ulno closed 4 years ago

ulno commented 4 years ago

Just trying to send some packages from my Wemos D1 Mini (ESP8266, very similar to NodeMCU) to a small PJON python receiver on my desktop. I can send the package, however looking at the Network device with tcpdump shows a very weird behavior:

00:57:52.028368 IP 192.168.12.9.7000 > 192.168.12.1.7000:  [|rx] (0)
    0x0000:  f8d1 110c 9230 a020 a60b a237 0800 4500
    0x0010:  001c d98f 0000 ff11 48e6 c0a8 0c09 c0a8
    0x0020:  0c01 1b58 1b58 0008 2fd3
00:57:52.028657 IP 192.168.12.9.7000 > 192.168.12.1.7000:  [|rx] (0)
    0x0000:  f8d1 110c 9230 a020 a60b a237 0800 4500
    0x0010:  001c d990 0000 ff11 48e5 c0a8 0c09 c0a8
    0x0020:  0c01 1b58 1b58 0008 2fd3
00:57:52.028991 IP 192.168.12.9.7000 > 192.168.12.1.7000:  [|rx] (0)
    0x0000:  f8d1 110c 9230 a020 a60b a237 0800 4500
    0x0010:  001c d991 0000 ff11 48e4 c0a8 0c09 c0a8
    0x0020:  0c01 1b58 1b58 0008 2fd3
00:57:52.029459 IP 192.168.12.9.7000 > 192.168.12.1.7000:  [|rx] (0)
    0x0000:  f8d1 110c 9230 a020 a60b a237 0800 4500
    0x0010:  001c d992 0000 ff11 48e3 c0a8 0c09 c0a8
    0x0020:  0c01 1b58 1b58 0008 2fd3

I see UDP packages coming in every 0.3 ms, meaning about 30000 messages per second. I highly doubt that this is very efficient.

This problem also occurs using the Arduino esp8266 Transmitter example.

Is this expected, or is there a way to prevent this?

ulno commented 4 years ago

It seems to be limited to the esp8266, running the same example on an esp32 only shows a boost of 3-4 packages when data is sent. The described problem starts on the esp8266 before sending actual packages. The bus.begin() call is basically open the flood gates.

ulno commented 4 years ago

That's not true - begin just enables it. The actual package seems to be generated in bus.update()

gioblu commented 4 years ago

Ciao @ulno thank you very much for reporting this and running an extended analysis. Can you check if the PJON_ESP constant is defined here: https://github.com/gioblu/PJON/blob/master/src/strategies/GlobalUDP/GlobalUDP.h#L125 ? It may help to rule out that the problem is not be related to the back-off.

ulno commented 4 years ago

I added #pragma message "**** iotempower check PJON_ESP defined ****" after #ifdef PJON_ESP and indeed this is defined as the message shows up during compilation.

So it might be not a problem in back_off?

gioblu commented 4 years ago

Thank you for your feedback @ulno . Would you share the program you use on transmitter side?

ulno commented 4 years ago

Sure, it's basically the normal example compiled with Lolin(Wemos) D1 R2 & mini mini board settings in Arduino IDE.

transmitter:

#define PJON_ESP
#include <PJON.h>

// Address of remote device
uint8_t remote_ip[] = { 192, 168, 12, 1 };

// <Strategy name> bus(selected device id)
PJON<GlobalUDP> bus(45);

const char* ssid     = "secret";
const char* password = "secret";

uint32_t cnt = 0;
uint32_t start = millis();

void receiver_function(uint8_t *payload, uint16_t length, const PJON_Packet_Info &packet_info) {
  if (payload[0] == 'u') cnt++;
}

void setup() {
  Serial.begin(115200);
  Serial.println("Transmitter started.");
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.printf("Now listening at IP %s\n", WiFi.localIP().toString().c_str());

  bus.strategy.add_node(1, remote_ip);
  bus.set_receiver(receiver_function);
  bus.begin();
  bus.send_repeatedly(1, "ulno", 4, 2000000); // Send P to device 44 repeatedly
}

void loop() {
  bus.update();
  bus.receive();

  if (millis() - start > 1000) {
    Serial.print("PONG/s: "); Serial.println(1000.0f*float(cnt)/float((uint32_t)(millis()-start)));
    start = millis();
    cnt = 0;
  }
};

receiver:

import pjon_cython as PJON

class GlobalUDP(PJON.GlobalUDP):
    # you can overload __init__ if you want
    def __init__(self, device_id):
        PJON.GlobalUDP.__init__(self, device_id)
        self.packets_received = 0
    def receive(self, data, length, packet_info):
        print()
        print ("Recv ({}): {}".format(length, data))
        print (packet_info)
        self.packets_received += 1
        self.reply(b'P')

g = GlobalUDP(1)

#g.loop()

#g.send(1,b"localtest")

while True:
    packets_to_send, receive_status = g.loop()
    print("\rPackets received:", g.packets_received, end="");
ulno commented 4 years ago

This is my tcpdump command on the linux accesspoint started with hostapd and dnsmasq through create_ap: tcpdump -xx -n -i ap0

fredilarsen commented 4 years ago

I have not had the time to look into this yet. @ulno Can you try letting GlobalUDP.get_max_attempts() return 1 instead of 10 and see if that changes anything?

fredilarsen commented 4 years ago

As a starting point I used tcpdump on my Windows server running my ModuleInterface master and web server for my home automation system, filtering out only the relevant port number. I did see only single packets, no duplications in either direction.

There are obvious differences in the setups, like DualUDP instead of GlobalUDP (but these are nearly identical except for the startup autolocation part of DualUDP). Also it is not WiFi but cabled Ethernet, and my (router) device is an Arduino Nano and not an ESP8266.

But as a starting point I see no problems for the Windows-Arduino communication. I must set up a ESP8266 module to see if that introduces problems. Next step is to use GlobalUDP instead of DualUDP (that will be a step back, have to assign IP addresses manually again...). Last step is to use PJON-cython instead of my master program compiled from C++ in Visual Studio.

@ulno On your side, could you try to build the C++ program instead of using PJON-cython, just to eliminate that as a problem source? I did read your statement that ESP32 worked better, but still...

ulno commented 4 years ago

Kind of at a loss here:

  1. I changed GlobalUDP.get_max_attempts() to return 1 -> re-flashed, no change
  2. I ran the c-receiver (i cloned PJON repo, adjusted settings of GlobalUDP receiver example, and run make in GlobalUDP example, and then run the receiver) 2 did though not show any received pings (the python program did). Is this due to running 11.1 and head mixed?

The network flooding also still occurs. What is actually this package, I see on tcpdump? As you see it sends now the package always "just" twice (is this because I changed the attempt count?):

11:52:22.713921 IP 192.168.12.9.7000 > 0.0.0.0.0:  [|rx] (0)
    0x0000:  ffff ffff ffff a020 a60b a237 0800 4500
    0x0010:  001c 6516 0000 ff11 8a09 c0a8 0c09 0000
    0x0020:  0000 1b58 0000 0008 17d5
11:52:22.713935 IP 192.168.12.9.7000 > 0.0.0.0.0:  [|rx] (0)
    0x0000:  ffff ffff ffff a020 a60b a237 0800 4500
    0x0010:  001c 6516 0000 ff11 8a09 c0a8 0c09 0000
    0x0020:  0000 1b58 0000 0008 17d5
11:52:22.715499 IP 192.168.12.9.7000 > 0.0.0.0.0:  [|rx] (0)
    0x0000:  ffff ffff ffff a020 a60b a237 0800 4500
    0x0010:  001c 651e 0000 ff11 8a01 c0a8 0c09 0000
    0x0020:  0000 1b58 0000 0008 17d5
11:52:22.715511 IP 192.168.12.9.7000 > 0.0.0.0.0:  [|rx] (0)
    0x0000:  ffff ffff ffff a020 a60b a237 0800 4500
    0x0010:  001c 651e 0000 ff11 8a01 c0a8 0c09 0000
    0x0020:  0000 1b58 0000 0008 17d5

I see these packages independently if I run a receiver (c or python) or not.

fredilarsen commented 4 years ago

@ulno Master is at this point not 11.2 but 12.0 soon to be released so I am not sure of compatibility as this is a major version step. I would definitively try the same version in all devices.

What is strange here is that the recipient is 0.0.0.0 (broadcast) while it is a specific IP in your first capture.

Also, it seems that get_max_attempts() returns the number of retries, not including the first send. Do you get the duplicates (the ones that are very close in time) eliminated by setting it to 0?

"ulno" is 75 6c 6e 6f in hex. It is strange that I do not see this in the packet capture. Can you make it print the length of each packet as well?

ulno commented 4 years ago

I have been using mixed versions for my latest tests (but not for the start as cython version is 11.2). Which version are you using?

Agreeing regarding the broadcast, I thought that was strange too.

I might be able to remove the duplicates, however they are not the issues as also the follow-up package after 2 microseconds, so the duplicates are not really the issue here.

You don't see "ulno" as the problem occurs also without ever sending a package.

I can leave out bus.send_repeatedly(1, "ulno", 4, 2000000); and see messages sent every time I call bus.update() - I assume it sends out here 1+[number of retry attempts] broadcast packages.

You see the full content of the package. So it's 42 bytes (is that a coincidence?!).

fredilarsen commented 4 years ago

Hmm, I am thinking of another experience I had when working on the ESP8266 with these strategies. I was expecting the LocalUDP to run fine as it does on Arduino, Windows and Linux, but it nearly does not work at all on ESP8266. Abysmal performance. It is based on UDP broadcasts. My conclusion/theory was that I could not get it to work properly because UDP broadcasting must be broken for the ESP8266 either in the library or in some firmware. But it could have been a packet flood of this kind that made my LocalUDP broadcasts disappear.

With GlobalUDP which does not use broadcasts but directed packets, it worked much better at that time. But I did not capture packets so something could have happened without me seeing it.

I think the next step must be to create a non-PJON related sketch sending UDP packets and see if that is working as expected without the extra packets.

But the fact that calling update sends a packet even if there are none in the queue requires investigation, as it should not be possible because update() is written to do nothing if there are no queued packets. Can you confirm that when not calling update, none of these packets are captured, and when calling update they are?

gioblu commented 4 years ago

I am sorry for the late response but I got struck by a strong flu, Ciao @ulno @fredilarsen reading your feedback makes me suspect the problem is present in GlobalUDP strategy or in the UDPHelper or something within the ESP8266. I have never seen packets sent by update with an empty buffer and reading the code that must be impossible thanks to this conditional here: https://github.com/gioblu/PJON/blob/master/src/PJON.h#L981 and the fact that buffer data is correctly initialized at startup here: https://github.com/gioblu/PJON/blob/master/src/PJON.h#L878 and it is correctly cleared in case of deletion here: https://github.com/gioblu/PJON/blob/master/src/PJON.h#L514

gioblu commented 4 years ago

I have tried to compile this example:

#include <PJON.h>

// Address of remote device
uint8_t remote_ip[] = { 192, 1, 1, 163 };

// <Strategy name> bus(selected device id)
PJON<GlobalUDP> bus(45);

const char* ssid     = "secret";
const char* password = "secret";

uint32_t cnt = 0;
uint32_t start = millis();

void receiver_function(uint8_t *payload, uint16_t length, const PJON_Packet_Info &packet_info) {
  if (payload[0] == 'P') cnt++;
}

void setup() {
  Serial.begin(115200);
  Serial.println("Transmitter started.");
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.printf("Now listening at IP %s\n", WiFi.localIP().toString().c_str());

  bus.strategy.add_node(44, remote_ip);
  bus.set_receiver(receiver_function);
  bus.begin();
  //bus.send_repeatedly(44, "P", 1, 20000); // Send P to device 44 repeatedly
}

void loop() {
  bus.update();
  bus.receive();

  if (millis() - start > 1000) {
    Serial.print("PONG/s: "); Serial.println(1000.0f*float(cnt)/float((uint32_t)(millis()-start)));
    start = millis();
    cnt = 0;
  }
};

Commenting the send_repeatedly function as @ulno suggested. To see if an update call goes for some reason through also with an empty buffer I have added a couple of prints after line 981: https://github.com/gioblu/PJON/blob/master/src/PJON.h#L981

Serial.print("Packet ");
Serial.print(i);
Serial.print(" attempts ");
Serial.println(packets[i].attempts);

Result:

........Now listening at IP 192.168.1.106
PONG/s: 0.00
PONG/s: 0.00
PONG/s: 0.00
PONG/s: 0.00

As you can see I cannot replicate what @ulno reports, no packets are sent by update if no packets are scheduled to be sent, or inserted in the bufer.

ulno commented 4 years ago

Thanks @gioblu and @fredilarsen for taking a look. I also had the flu and then started a bit late at looking at PJON for my IoT classes, so I got stuck at the here reported problem. However, I am now on a week vacation with very little technology access and will then fly to Austria for a full IoT introduction class. I am not sure if I cover PJON this time as I mainly have esp8266 devices and Raspberry Pis there and I have to dig deeper to pro-actively solve problems.

As I am the only one who can reproduce the problem, please feel free to delay your efforts until December,when I am back and can take a closer look again. I still need to test if it has to do with the hostapd based accesspoint, from linux to linux, or some other ideas why the package flood is triggered.

ulno commented 4 years ago

Good, I am back and have a bit time for testing, I am working with the latest example by @fredilarsen.

I am also seeing the network storm on a "real" wifi router from ubiquiti and an attached network switch to which my network is connected.

I made though an interesting new observation: When both the following lines in the example are disabled/uncommented, the network storm stops:

bus.send_repeatedly(44, "P", 1, 20000); // Send P to device 44 repeatedly // line 1 in setup
bus.receive(); // line 2 in loop

If I enable any of them the network storm starts. So, the storm does NOT happen in update, when send_repeatedly is disabled.

I am trying now to track down what is going on in receive (without send_repeatedly enabled), but I am getting very lost in receive. Any hints what and where I should debug?

ulno commented 4 years ago

More testing shows, the storm (or at least one package is triggered each time) is happening in the call udp.receive_frame(data, max_length); when called from receive_frame in GlobalUDP.h. I will continue to look...

ulno commented 4 years ago

Seems like udp.flush(); in UDPHelper_Arduino sends out a udp package each time it is called??! This is from WifiUDP class of Espressif.

gioblu commented 4 years ago

Ciao @ulno happy to hear from you again, consider that calling bus.send_repeatedly(44, "P", 1, 20000) will make PJON send a packet containing P every 20 MILLIseconds.

ulno commented 4 years ago

I am aware of that, however it's commented out, so why is a send of a package triggered in udp.flush()?

gioblu commented 4 years ago

Ciao @ulno thank you for your fast reply. If I understood correctly, that happens if you call just bus.receive() without ever calling bus.send or bus.send_repeatedly? That is quite strange.

What you mean with

Seems like udp.flush(); in UDPHelper_Arduino sends out a udp package each time it is called??! This is from WifiUDP class of Espressif.

gioblu commented 4 years ago

Ciao @ulno I see what you mean, it seems ESP8266's flush may behave differently and this could explain why this issue does not appear when using ESP32. Thank you I will dig further and try to replicate the issue.

fredilarsen commented 4 years ago

I see that strangely there is a flush call at the start of the receive call in UDPHelper for Arduino. That seems stupid, but I have written a comment about it:

#ifdef PJON_ESP
udp.flush(); // Empty receive buffer so it is prepared for new packet
#endif

So this must have been to solve an observed problem.

But obviously it seems to not only "Empty receive buffer" but probably also send a dummy packet, so it should be removed. Then I guess some problem may arise again, unless fixed in ESP libraries. But potentially the receive buffer can be emptied in some other way if still a problem.

I see there has been multiple fixes related to this in the ESP WiFiUdp.cpp file by looking at the history. Probably my fix was a circumvention that is no longer needed, also causing a problem.

fredilarsen commented 4 years ago

@ulno Can you help me to test if removing the flush() call in interfaces/ARDUINO/UDPHelper_ARDUINO.h fixes it all when combined with a recent ESP WiFi library? (I see relevant library fixes done in 2018 but none in 2019)

ulno commented 4 years ago

Ok, I will comment it out and test.

ulno commented 4 years ago

Looks good so far, but need to test some actual examples.

fredilarsen commented 4 years ago

Seems there is a big difference from ESP32 to ESP8266. ESP32 required every byte of each message to be read, otherwise it will not clear the read buffer and that will block any more incoming messages. flush() did just clear the read buffer, so I did use that to clear the buffer explicitly to avoid a packet larger then the receive buffer to stop the data flow. But that obviously does not flow well on ES8266 where flush does send a packet even if there is no unsent data.

gioblu commented 4 years ago

Ciao @ulno thank you very much for your detailed report. Reading @fredilarsen's code I see that the flush call on ESP8266 calling endPacket would send a packet if ctx is not empty. That would explain also the malformed/partial packets you have shared above.

ulno commented 4 years ago

Good news, I tested the the example with the Linux example as a receiver and all seems to work. So removing the flush seems to make UDP on the ESP8266 work.

One confusing note though: the example for Linux defines a port, the example for Arduino doesn't - just had to take a close that you cannot "just" mix platforms for testing. Maybe the examples should be more unified?

fredilarsen commented 4 years ago

I think the setting of the port in one example was to demonstrate the possibility. If you set up two groups of devices, and each group has a common port different from the other group, the two groups will be completely isolated from each other even if running on the same network.

But you are probably right, it is best to keep it minimalistic and working across all platforms without changes to the examples.

gioblu commented 4 years ago

Ciao @ulno I hope you are fine! I will close the issue here, please let us know how your experiments ended up. Thank you for your support.