glenn20 / micropython-espnow-utils

Some utilities for use with espnow and wifi on micropython
MIT License
22 stars 5 forks source link

scan sets channel 14 - invalid except in Japan #2

Open peterhinch opened 1 year ago

peterhinch commented 1 year ago

On the ESP32 reference board, scan fails with

RuntimeError: Wifi Unknown Error 0x0102

This is fixed by changing the channel range to (1, 14). Note that channels 12 and 13 are deprecated in the US ref - this is probably not an issue.

peterhinch commented 1 year ago

A further observation. On ESP32 scan finds the correct channel initially, but after a WiFi outage subsequent invocations sometimes return an incorrect channel (out by 1). When this occurs ESPNow transmissions become unreliable with send occasionally reporting success when it has in fact failed. I have never observed this behaviour when the channel is correct.

glenn20 commented 1 year ago

Thanks for the reports Peter (and apologies for the slow response - I wasn't getting alerts on issues on this repo).

Re. the channel issue, that's odd - it works fine for me here to include channel 14 and as far as I can tell, micropython does not implement the country restrictions on esp32. I'll keep an eye on that.

Re. the out-by 1, I can understand that might happen. I use a really simple strategy to identify the central channel. Generally, the e.send() in espnow_scan.scan() will succeed on the real channel and also on the channels either side. I find it common to get success on one channel or 3, but if there is success on two or four channels, I need to guess which one to use. I'll add some more robust checks (repeated tests on those channels for which the send() is successful). Wifi relies on the AP broadcasting the channel information, so the receiver can switch to that channel with certainty in spite fo the presence of cross-talk. We can't do that in general for espnow without imposing a comm protocol onto the espnow traffic (which is what I do for my own usage).

peterhinch commented 1 year ago

as far as I can tell, micropython does not implement the country restrictions on esp32.

Is it possible that the underlying RTOS has country variants? In any event the error could be trapped.

peterhinch commented 1 year ago

I'm posting the following as a general observation which might be of interest in the context of scan algorithms.

Seon suggested running with a high channel number and reduced TX power, which does seem to have greatly improved reliability. He's also sent me a couple of samples of the latest PCB revision (P8) which (on his advice) I'm testing at full power on a low channel. These also seem fine. I gather the S3 chip has problems with mutual interference between its radio and the on-chip USB.

I've also been investigating how to cope with channel changes. In situations where the channel number may change dynamically, on power up a node issues station interface connect(). The node doesn't access any resources on the net, using ESPNow from then onwards. If the AP channel changes I typically see about five instances where an ESPNow send returns False, before settling down to return True. The odd thing is that the transmissions which return False have actually succeeded.This behaviour occurs even if there is a large change in channel, from 3 to 9. (My test script sends a message every 3s).

I find this degree of cross-channel response surprising. Especially given that transmitter and receiver are in different rooms and on different floors.

If the send return value is True, in my testing there is about a 1% chance that the transmission actually failed. A False return value is meaningful if the channel is correct otherwise it's a complete lottery. I guess I can work round this: if a micropower node wakes up, sends an ESPNow message and gets False it assumes the channel has changed and does a WiFi connect (guzzling power). It saves the channel to a file (or NVS on ESP32) for subsequent wakes.

Any thoughts prompted by these ramblings would be welcome :)

glenn20 commented 1 year ago

@peterhinch : I have pushed an update to espnow_scan.py with a new scan methodology.

In all my tests of the new methodology there has always been only candidate with a response rate close to 100%.

Please try the new method and see if this works mroe reliably for you.Thanks.

glenn20 commented 1 year ago

I recommend you set verbose=True to see the details of the channel scanning results. You can set num_pings larger if you want to increase statistical confidence in each channel scan.

import espnow_scan as scan

scan.scan(b'abcdef1234`, num_pings=10, verbose=True)
glenn20 commented 1 year ago

I'm posting the following as a general observation which might be of interest in the context of scan algorithms.

Thanks for taking the time to post these - these are very useful.

Seon suggested running with a high channel number and reduced TX power, which does seem to have greatly improved reliability. He's also sent me a couple of samples of the latest PCB revision (P8) which (on his advice) I'm testing at full power on a low channel. These also seem fine. I gather the S3 chip has problems with mutual interference between its radio and the on-chip USB.

I've also been investigating how to cope with channel changes. In situations where the channel number may change dynamically, on power up a node issues station interface connect(). The node doesn't access any resources on the net, using ESPNow from then onwards. If the AP channel changes I typically see about five instances where an ESPNow send returns False, before settling down to return True. The odd thing is that the transmissions which return False have actually succeeded.This behaviour occurs even if there is a large change in channel, from 3 to 9. (My test script sends a message every 3s).

I find this degree of cross-channel response surprising. Especially given that transmitter and receiver are in different rooms and on different floors.

In my tests, I have found that cross-channel response over many channels does occur occassionaly, and will often be in a burst of several packets. I haven't isolated a cause as it is sporadic and I'm not sure this is the same phenomenon as you are seeing.

It is quite possible for a transmission to be successfully received by the target device, but the ACK packet is lost by the sending device. This can be due to any of the usual causes of wifi packet loss affecting the ack packet, but not the sent packet. However, this will happen at a high rate if the two device are operating on different (perhaps adjacent) channels, sent packets may be received at say 30% efficiency and the acks will also be received at 30% efficicent of that. Again, I don't know if that is related to your scenario. I generally find that if I am sure the devices are operating on the same channel and there is not some obvious source of interfence (eg. microwave oven nearby) than the rate of successfully delivered messages for which the send does not receive an ack is "very low". ie. in a clean environment, thousands of message successfully received and acked over many hours with no dropped acks.

But, environments are not always so clean and the application should either be robust to lost messages (ie, the sender ignores infrequent missed acks from the sender, or the receiver application is resilient to occasional re-transmissions from the sender).

I've also found that the amount of cross-talk remains high at greater distances between sender and receiver.

glenn20 commented 1 year ago

If the send return value is True, in my testing there is about a 1% chance that the transmission actually failed.

This scenario is difficult to explain (short of bugs in my or the underlying espressif ESP-NOW code) - if send() returns True (and sync is not set to False), then the ESP32 is reporting that an ack packet was received from the target device. The only cause I am aware of for this is that the packet has been dropped at the application layer on the receiver because the rx_buffer on the receiver has overflowed and the packet has been dropped. In this case ESPNow.stats() will return a non-zero value for the last number in the tuple.

You might find ESPNow.stats() useful for tracking the number of acks and dropped packets, etc.... and may be helpful in diagnosing issues. These are updated from the ISR

A False return value is meaningful if the channel is correct otherwise it's a complete lottery.

Yes - this is to be expected. If the packet transmit success rate for cross-channel comms is say 30%, then we also have only a 30% chance of success of the ack packet from the receiver.

You really want to be operating on the same channel!

My espnow_scan.py is intended to be a a way to find a device assuming no specific communication protocol with the target device (ie. no specific response is required from the device, other than the automatic ESP-NOW ack). My recent push should make this much more reliable, but it is much better to have the target respond to a "ping" with a message containing it's current operating channel.

After all a wifi AP continually broadcasts it's operating channel. A device that wants to pair with it, may receive the message on a different channel, but knows what channel it should select from communication with the AP.

I guess I can work round this: if a micropower node wakes up, sends an ESPNow message and gets False it assumes the channel has changed and does a WiFi connect (guzzling power). It saves the channel to a file (or NVS on ESP32) for subsequent wakes.

For my apps:

This requires an agreed communication protocol between the devices.

I actually have another layer of fallback beyond that, where the sensor device will initiate a discovery request using broadcasts across all channels to identify which devices provide the named "service" (so deployed devices can continue to operate if I need to change devices). In this case, the sender will authenticate the new "service" node before accepting it as the new target.

Running your mqtt gateway on an ethernet connected esp32 device makes all this a LOT simpler :wink:.

peterhinch commented 1 year ago

Thanks for that - I hadn't thought of the ping-pongX approach - send ping on each channel until you get a pongX and use channel X. This should save power compared to doing a WiFi connect.

My gateway remains a WIP partly because I've spent time chasing hardware issues with the FeatherS3. The P8 release boards seem good but the earlier releases are picky about channels and txpower.

gateway.py is reasonably stable but it's taking me a while to figure out the best way to support the nodes. There are a number of variables:

glenn20 commented 1 year ago

gateway.py is reasonably stable but it's taking me a while to figure out the best way to support the nodes. There are a number of variables:

  • Node apps may run continuously or be micropower.
  • Continuously running apps may be synchronous or asynchronous.
  • Need to support ESP8266 (no NVS) and ESP32.
  • Channel may be fixed or variable.

Another option for saving data across deepsleep is rtc memory - I've found that quite useful for really fast and save data. Especially with wake stubs.

I really hope to get some time soon...it's been a bit exasperating not having the time I expected to finish up some of this work.

peterhinch commented 1 year ago

If the send return value is True, in my testing there is about a 1% chance that the transmission actually failed.

In the light of your comments I reviewed the code used in this test. The 1% figure included (and probably consisted entirely of) simple transmission failures. Apologies for the bum steer.

Thanks for the pointer re rtc memory :)

glenn20 commented 1 year ago

Just a heads-up.

I thought I had managed to come up with a reliable method for determing the channel (without requiring channel information to transmitted back from the peer) in the last commit to espnow_scan.py, but that turned out to be optimistic. It is much more reliable than the previous methodology and worked reliably for hours of tests, but further testing revealed that I sometimes I get 100% response rates on the correct channel and the side channels as well. The behaviour is hard to predict - and will occur repeatedly for long periods and then stop. The method will still work fine in that case as it picks the middle channel in that case, but occasionally I get 0-20% on one side channel and 100% on the other side channel and then it becomes a dice throw.

I think an approach that adapts based on level of certainty from the initial scan would be the best approach.

glenn20 commented 1 year ago

as far as I can tell, micropython does not implement the country restrictions on esp32.

Is it possible that the underlying RTOS has country variants?

Speculating - maybe espressif burns this information into the fuses of some devices sold in various jurisdictions.

In any event the error could be trapped.

Oh, BTW, I put in a trap for RuntimeError around sta.config(channel=X), which should help if the error is being thrown when setting the channel. Do you know if that is where the error is thrown?

peterhinch commented 1 year ago

[EDIT] I've now found the board which exhibits the fault. Here is the traceback

>>> espnow_scan.scan(b'$b\xab\xe6\xb0\xb5')
Found peer b'$b\xab\xe6\xb0\xb5' on channel 3.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "espnow_scan.py", line 50, in scan
  File "espnow_scan.py", line 13, in set_channel
RuntimeError: Wifi Unknown Error 0x0102

Trapping the exception enables it to run to completion:

    if sta.active() and sys.platform != "esp8266":
        try:
            sta.config(channel=channel)  # On ESP32 use STA interface
        except RuntimeError:
            pass
        return sta.config("channel")

I think your idea about fuses is probably correct.

I have adopted your suggestion of determining the channel by querying the gateway which works fine.

I've ordered a Nordic power profiler which should be an improvement over my attempts with a 1Ω resistor, DSO and visual integration. Any observations or tips?

glenn20 commented 1 year ago

Thanks Peter. That's very helpful.

glenn20 commented 1 year ago

Trapping the exception enables it to run to completion:

    if sta.active() and sys.platform != "esp8266":
        try:
            sta.config(channel=channel)  # On ESP32 use STA interface
        except RuntimeError:
            pass
        return sta.config("channel")

Thanks, I trap the error in my most recent update (however, I just realised I hadn't pushed it up to the repo - sorry about that - including the new methodology for scanning). Vscode has changed something and it wasn't as obvious that I hadn't pushed to the repo.

I have adopted your suggestion of determining the channel by querying the gateway which works fine.

I've ordered a Nordic power profiler which should be an improvement over my attempts with a 1Ω resistor, DSO and visual integration. Any observations or tips?

Off the top of my head:

I use it in two modes:

Oh - and it's such a good value device for these measurements. Have fun with it.

PXL_20230810_025358329

glenn20 commented 1 year ago

Oh - and the logic inputs are useful for timing as well (poor man's logic analyser).

peterhinch commented 1 year ago

There was a bit of a learning curve with the software, but it's a nice piece of kit with very neat minimalist physical design. The digital inputs could prove useful for identifying specific points in the power up sequence. I have a Saleae LA for more general use. I started out with a £6 Chinese Saleae clone which worked well until I needed pre-trigger capture. The Saleae is quite superb but not cheap.

I have one purely academic query. When I first tried to use it, the s/w directed me to download and run nrf-udev_1.0.1-all.deb to install a udev rule. This worked, but oddly I can't find the rule, either in /etc/udev/rules.d or in /usr/lib/udev/rules.d. Do you know where it is?

glenn20 commented 1 year ago

There was a bit of a learning curve with the software, but it's a nice piece of kit with very neat minimalist physical design. The digital inputs could prove useful for identifying specific points in the power up sequence. I have a Saleae LA for more general use. I started out with a £6 Chinese Saleae clone which worked well until I needed pre-trigger capture. The Saleae is quite superb but not cheap.

Yes, it looks like a nice bit of kit. I used to have access to LAs at work, but not any more. I've managed to get by so far.

I have one purely academic query. When I first tried to use it, the s/w directed me to download and run nrf-udev_1.0.1-all.deb to install a udev rule. This worked, but oddly I can't find the rule, either in /etc/udev/rules.d or in /usr/lib/udev/rules.d. Do you know where it is?

dpkg -L nrf-udev responds with: /lib/udev/rules.d/99-mm-nrf-blacklist.rules and /lib/udev/rules.d/71-nrf.rules. All it does is to make /dev/ttyACM* world readable and writable. A dubious choice.

zcattacz commented 9 months ago

I thought I had managed to come up with a reliable method for determing the channel (without requiring channel information to transmitted back from the peer) in the last commit to espnow_scan.py, but that turned out to be optimistic. It is much more reliable than the previous methodology and worked reliably for hours of tests, but further testing revealed that I sometimes I get 100% response rates on the correct channel and the side channels as well. The behaviour is hard to predict - and will occur repeatedly for long periods and then stop. The method will still work fine in that case as it picks the middle channel in that case, but occasionally I get 0-20% on one side channel and 100% on the other side channel and then it becomes a dice throw.

In Accesspoint Controllers, channels are usually grouped. Only 3-4 central frequencies out of the 14 channels are configured for neighboring APs to hop around. Sounds like the side channel issue and scanning speed could benefit from this strategy.