micropython / micropython

MicroPython - a lean and efficient Python implementation for microcontrollers and constrained systems
https://micropython.org
Other
19.16k stars 7.67k forks source link

Pico W - Multicast receive stops working until after `machine.reset()` #11294

Open t8y8 opened 1 year ago

t8y8 commented 1 year ago

Issue Description

When testing sending multicast packets back and forth between my Pi Pico W and my Windows 10 machine, the Pico will initially work, receiving packets sent to the same multicast group until you CTRL-C, hit stop, or leave it sitting a while. Eventually you will get into a state where it doesn't receive the packets anymore, though it is still listening and presumably still a member of the multicast group.

After this point, the only way to get things working again is to machine.reset() or physically power cycle the device. Once you do, things will work again. I've confirmed the packets are all sending via wireshark even after the Pico stops receiving things.

I found https://github.com/micropython/micropython/issues/10812 which looks similar, but is resolved by using SO_REUSEADDR, which I am doing, but still have the issue.

The most reliable way to reproduce:

  1. In Thonny, run the pico code
  2. In Windows Terminal, run the sender
  3. See it receive on the pico side. hit stop in Thonny.
  4. Hit start again (the sender script should still be going)
  5. It may work for 5ish more packets then stop. If not stop and start another time or two
  6. Eventually you will be unable to receive the multicast packets, even if you call sock.close() in the repl and make a new socket
  7. Curiously, if I send a packet to the pico directly, or to the broadcast address (192.168.1.255) it will be received, even when in this state

Pico W Code (Receiver)

import network
import socket
from time import sleep
import machine

ssid = 'SSID'
password = 'PASSWORD'

def connect():
    #Connect to WLAN
    wlan = network.WLAN(network.STA_IF)
    wlan.active(True)
    wlan.connect(ssid, password)
    while wlan.isconnected() == False:
        print('Waiting for connection...')
        sleep(1)
    ip = wlan.ifconfig()[0]
    print(f'Connected on {ip}')
    return ip

def inet_aton(addr):
    return bytes(map(int, addr.split(".")))

###

ip = connect()

MCAST_GRP = '224.1.5.16'
MCAST_PORT = 9242

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP,  inet_aton(MCAST_GRP) + inet_aton("192.168.1.72"))
sock.bind(('', MCAST_PORT))

try:
    while True:
        print('waiting to receive message...')
        data, address = sock.recvfrom(16)

        print(f"received {len(data)} bytes from {address}")
        print(f"{data}")
finally:
    sock.close()

Windows Code (Sender)

import time
import socket

MCAST_GRP = '224.1.5.16'
MCAST_PORT = 9242

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
s.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, 2)
s.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_IF, socket.inet_aton("192.168.1.206"))

for i in range(100):
    s.sendto(b'some data1 %i' % i, (MCAST_GRP, MCAST_PORT))
    time.sleep(0.33)
s.close()

Hardware

dpgeorge commented 1 year ago

Thanks for the detailed report. I tried to reproduce the issue but could not:

t8y8 commented 1 year ago

Thanks for the prompt reply!

I tweaked the receiver.py (I call it multi.py) to match, but I can't use 0.0.0.0 on Windows or it gets cranky.

I repeated the experiment using mpremote and recorded the results. They are the same sequence of working > receive 5ish > no longer working.

https://user-images.githubusercontent.com/4370533/233274683-956f2405-c145-4475-a6b4-c3887c77eb3a.mp4

I am also attaching a packet capture, filtered to packets from the pi or from windows to the multicast group. I'm not an expert here but it seems to work fine. (Note the video and the packet capture are different attempts so times won't match) capture.zip

Given that I can receive direct packets to the pi on that port, it feels like the issue is somewhere in the multicast part of lwip or similar. On a power cycle when everything works, it takes about 4 seconds to connect to wifi and sends a group join IGMP packet. On a restart of the script without a power cycle it connects to wifi instantly and sends no group-join packet, despite in theory being a new socket with the setsockopt calls.

I also tried a few things to narrow down the environment:

Please let me know what other data I can provide. I can pull and files or logs if given instructions, since I can reproduce this at will.

dpgeorge commented 1 year ago

One thing I didn't mention which may be important is a change to the receiver script in the connect() function. I have separate scripts that connect a board to the local WiFi and use that once at the start of my session, so then test scripts themselves don't need to connect (or know ssid/password).

The connect function for the above receiver test looks like this in my case:

def connect():
    #Connect to WLAN
    wlan = network.WLAN(network.STA_IF)
    ip = wlan.ifconfig()[0]
    print(f'Connected on {ip}')
    return ip

Note that the WiFi connection will be retained over a soft reset, so you can connect manually at the REPL (or with a separate mpremote run connect.py script, for example) and then run the test after connecting to the WiFi.

Can you please try that and see if the problem persists?

t8y8 commented 1 year ago

Woohoo! That appears to be the problem!

I created a connect.py file that connects to wifi and moved it out of the script I start and stop. Running multi.py from Thonny repeatedly with CTRL-C or Stop and Start all worked fine, with all packets received.

I then minimally added bits of wifi initialization back in:

So something happens when calling connect if there's already an active wifi connection.

As a workaround I can split connect out, but I do think the behavior is unexpected and is either a subtle bug or deserves supporting documentation. I can see lots of folks just connecting to wifi at the beginning of a main.py and relying on that after resets or coming out of sleep or something.

dpgeorge commented 1 year ago

Thanks for testing and confirming where the issue lies.

This may take some time to fix properly, so for now please use the workaround of only connecting once.

peterhinch commented 1 year ago

@t8y8 This may be relevant. The module mqtt-as is designed to recover from outages to any of WiFi, broker, and internet connectivity. We found that the most reliable way to resume after any outage was explicitly to disconnect from WiFi and then reconnect.