mikerenfro / iot-office-door-sign

IOT Office Door Sign using PyPortal
MIT License
8 stars 1 forks source link

Intermittent PyPortal crashes #1

Open mikerenfro opened 5 years ago

mikerenfro commented 5 years ago

At times, the PyPortal will crash with part of a background displayed. Currently stress-testing the existing code with more frequent wget and fetch calls to see what exceptions are thrown and how best to handle them (including soft reboots if things go completely sideways).

mikerenfro commented 5 years ago

Had one hang inside adafruit_esp32spi/adafruit_esp32spi.py", line 184, in _send_command that was finally interrupted by Ctrl-C.

  File "code.py", line 90, in <module>
  File "code.py", line 86, in <module>
  File "adafruit_pyportal.py", line 740, in fetch
  File "adafruit_pyportal.py", line 737, in fetch
  File "adafruit_pyportal.py", line 733, in fetch
  File "adafruit_pyportal.py", line 730, in fetch
  File "adafruit_pyportal.py", line 582, in wget
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 235, in get
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 221, in request
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 194, in request
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 90, in readline
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 545, in socket_available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 298, in _send_command_get_response
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 184, in _send_command
KeyboardInterrupt:

PyPortal had been up for over 18612 seconds (a bit over 5 hours) at the time of the hang, with 10 second refreshes on the status text, and 120 second refreshes on the image. Had already added garbage collection calls after each status refresh, which had no effect on this hang.

Hang occurred while communicating with Adafruit.io image conversion service. Did not see evidence of any response in the debugging logs. Normally, I get {'access-control-max-age': '1728000', ... 'x-permitted-cross-domain-policies': 'none'} quickly after making the request. None of that showed when the hang occurred.

I assume that there would be a socket timeout due to line 160 of adafruit_esp32spi_requests.py, but that was never triggered.

mikerenfro commented 5 years ago

Had a second issue where the adafruit.io service correctly calculates the BMP size, but the fetch() call doesn't receive all the data:

Fetching stream from https://io.adafruit.com/api/v2/...
{'access-control-max-age': '1728000', ... 'content-length': '72850', ...}
Saving data to  /sd/cache.bmp
Read 512 bytes, 72338 remaining
Read 1024 bytes, 71826 remaining
...
Read 40960 bytes, 31890 remaining
Read 41472 bytes, 31378 remaining
Read 41559 bytes, 31291 remaining
Created file of 41559 bytes in 5.9 seconds
Set background to  /sd/cache.bmp

This one occurred with after roughly 7000 seconds of uptime. The previous fetch() call was 128 seconds earlier, and worked fine.

ladyada commented 5 years ago

ok so the second thing is fixed (well, at least it gives you an error) for the first thing, i don't see anything in _send_command that could hang :/ do you mind tryign with debug=True or debug=2 passed into pyportal obj creation?

mikerenfro commented 5 years ago

Further requests that failed:

Further requests that succeeded:

mikerenfro commented 5 years ago

Haven't seen a complete hangup of ESP32 today.

Around 2019-04-13 19:19:07.745 -0500, did see no response from Adafruit.io, but got an 'ESP32 not responding' with a timeout instead of an infinite hang.

The exception raised by PyPortal was handled correctly, and after a total of 7 'ESP32 not responding' messages (both for Adafruit.io and my own server), things went back to normal.

Around 2019-04-13 21:33:48.848 -0500, got no response from Adafruit.io, no timeouts, and an infinite hang (up until 2019-04-14 08:11 -0500):

Traceback (most recent call last):
  File "code.py", line 103, in <module>
  File "code.py", line 97, in <module>
  File "/lib/adafruit_pyportal.py", line 745, in fetch
  File "/lib/adafruit_pyportal.py", line 742, in fetch
  File "/lib/adafruit_pyportal.py", line 738, in fetch
  File "/lib/adafruit_pyportal.py", line 732, in fetch
  File "/lib/adafruit_pyportal.py", line 582, in wget
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 235, in get
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 221, in request
  File "adafruit_esp32spi/adafruit_esp32spi_requests.py", line 194, in request
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 90, in readline
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 545, in socket_available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 299, in _send_command_get_response
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 288, in _wait_response_cmd
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 265, in _wait_response_cmd
ReloadException:
soft reboot

Looks like the Adafruit.io service did make a request of my web server, and got a 200 response and the entire image, so whatever failed had to have been after the image download, and before the converted image would have been returned.

52.91.214.175 - - [13/Apr/2019:21:33:56 -0500] "GET /path/to/load.png HTTP/1.1" 200 10661 "-" "Ruby"

As of 2019-04-14 09:49:22.174 -0500, moved my original image from a work web server to an offsite VPS. Will leave the stress test running the rest of the weekend to see if hangs are limited to the work web server or any security hardware at the campus perimeter.

mikerenfro commented 5 years ago

Still having intermittent hangs that appear to be independent of the backend server, and regardless of if adafruit.io is involved.

With PyPortal debug=True, working:

Fetching stream from https://host/path/file.bmp
{'server': 'nginx/1.10.3 (Ubuntu)', 'content-type': 'image/x-ms-bmp', 'date': 'Tue, 30 Apr 2019 19:05:34 GMT', 'last-modified': 'Tue, 30 Apr 2019 19:00:02 GMT', 'connection': 'close', 'content-length': '108962', 'etag': '"5cc89b32-1a9a2"','accept-ranges': 'bytes'}
Saving data to  /sd/cache.bmp

With PyPortal debug=True, failing:

Fetching stream from https://host/path/file.bmp
(nothing)

So response headers never get printed. Implies failure in adafruit_pyportal.py, line 587:

r = requests.get(url, stream=True)

requests.get() calls request("GET", url, **kw). In adafruit_esp32_requests.py, request() function, added four print statements:

Waiting to see which of these print on a hang.

ladyada commented 5 years ago

dya know what is the lowest level line it hangs at?

mikerenfro commented 5 years ago

Not yet. Haven't seen it hang after adding the debugging lines. I end up leaving it plugged in on one or another Mac at home or at work and leaving screen -L /dev/tty.usbmodem* running for extended periods instead of another serial monitor.

mikerenfro commented 5 years ago

11.3 hours of uninterrupted use at home, going to try again at work.

mikerenfro commented 5 years ago

4.5 hours uninterrupted at work (missed carrying it in yesterday morning), >14 hours uninterrupted at home. Either the bug was external to the PyPortal, unexpectedly fixed in a dependency, or doesn't show up on the original source .py files instead of the .mpy versions.

mikerenfro commented 5 years ago

Finally got a hang about an hour ago. Got the Data written to socket message, but not the Reading lines from socket one. Relevant lines:

print("Data written to socket")
line = sock.readline()
#print(line)
line = line.split(None, 2)
status = int(line[1])
reason = ""
if len(line) > 2:
    reason = line[2].rstrip()
print("Reading lines from socket")
mikerenfro commented 5 years ago

After adding additional debugging print statements, finally got another hang last night:

Failing logs:

Fetching stream from https://host/path/file.bmp
Timeout set
Data written to socket
Socket readline

From adafruit_esp32spi_requests.py:

        print("Data written to socket")
        line = sock.readline()
        #print(line)
        line = line.split(None, 2)
        status = int(line[1])
        reason = ""
        if len(line) > 2:
            reason = line[2].rstrip()
        print("Reading lines from socket")

From adafruit_esp32spi_socket.py:

    def readline(self):
        """Attempt to return as many bytes as we can up to but not including '\r\n'"""
        print("Socket readline")
        while b'\r\n' not in self._buffer:
            # there's no line already in there, read some more
            avail = min(_the_interface.socket_available(self._socknum), MAX_PACKET)
            if avail:
                self._buffer += _the_interface.socket_read(self._socknum, avail)
        firstline, self._buffer = self._buffer.split(b'\r\n', 1)
        gc.collect()
        return firstline

Only lines possibly causing an infinite loop/hang are the while loop in readline.

mikerenfro commented 5 years ago

No further hangs yet, but I've replaced my ad hoc debugging statements with passing debug=2 to the ESP_SPIcontrol creation in adafruit_pyportal.py. Hopefully that'll make it easier for others.

mikerenfro commented 5 years ago

Got my hang a bit ago. With all the ESP debugging enabled, a working request:

    Fetching stream from https://REDACTED/images/load.bmp
    *** Get host by name
            Sending param #0 is 16 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Parameter #0 length is 4
    Read 4:  [bytearray(b'DB\xcdx')]
    *** Get socket
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x00')]
    Allocated socket #0
    Timeout set
    *** Socket connect mode 2
    *** Open socket
            Sending param #0 is 16 bytes long
            Sending param #1 is 4 bytes long
            Sending param #2 is 2 bytes long
            Sending param #3 is 1 bytes long
            Sending param #4 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x04')]
    Writing: b'GET /images/load.bmp HTTP/1.0\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 31 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x1f\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'Host: REDACTED\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 24 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x18\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'User-Agent: Adafruit CircuitPython\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 36 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'$\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 2 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x02\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x00\x00')]
    ESPSocket: 0 bytes available
    Sending param #0 is 1 bytes long
    Parameter #0 length is 2
    ...
    ESPSocket: 0 bytes available
            Sending param #0 is 1 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x00@')]
    ESPSocket: 16384 bytes available
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x04')]
    Reading 4000 bytes from ESP socket with status 4
            Sending param #0 is 1 bytes long
            Sending param #1 is 2 bytes long
            Parameter #0 length is 4000
    Read 4000:  [bytearray(b'HTTP/1.1 200 OK\r\nServer: nginx/1.10.3 (Ubuntu)\r\nDate: Mon, 06 May 2019 13:32:28 GMT\r\nContent-Type: image/x-ms-bmp\r\nContent-Length: 108962\r\nLast-Modified: Mon, 06 May 2019 13:00:01 GMT\r\nConnection: close\r\nETag: "5cd02fd1-1a9a2"\r\nAccept-Ranges: bytes...\xf3\xf3')]
    ...

And a failing request:

    Fetching stream from https://REDACTED/images/load.bmp
    *** Get host by name
            Sending param #0 is 16 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Parameter #0 length is 4
    Read 4:  [bytearray(b'DB\xcdx')]
    *** Get socket
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x00')]
    Allocated socket #0
    Timeout set
    *** Socket connect mode 2
    *** Open socket
            Sending param #0 is 16 bytes long
            Sending param #1 is 4 bytes long
            Sending param #2 is 2 bytes long
            Sending param #3 is 1 bytes long
            Sending param #4 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x04')]
    Writing: b'GET /images/load.bmp HTTP/1.0\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 31 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x1f\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'Host: REDACTED\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 24 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x18\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'User-Agent: Adafruit CircuitPython\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 36 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'$\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
    Writing: b'\r\n'
            Sending param #0 is 1 bytes long
            Sending param #1 is 2 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x02\x00')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 1
    Read 1:  [bytearray(b'\x01')]
            Sending param #0 is 1 bytes long
            Parameter #0 length is 2
    Read 2:  [bytearray(b'\x00\x00')]
    ESPSocket: 0 bytes available
            Sending param #0 is 1 bytes long
            Parameter #0 length is 2

where the Read 2: [bytearray(b'\x00\x00')] through Parameter #0 length is 2 keep repeating.

mikerenfro commented 5 years ago

Updated to ESP32 firmware 1.3.0 to see if that makes any difference.

ladyada commented 5 years ago

yea looks like it just 'hangs' - we should detect this and do a hard reset

mikerenfro commented 5 years ago

I'm assuming this is the same root cause as in Fixed infinite loop when socket readline fails.