thegridelectric / gw-scada-spaceheat-python

GridWorks SCADA for space heating
MIT License
5 stars 2 forks source link

Intermittent failure in test_basic_atn_comm_loss #149

Open anschweitzer opened 1 year ago

anschweitzer commented 1 year ago

MQTT didn't reconnect within 3 seconds?

Ran with:

pytest -k test_basic_atn_comm_loss --count 100 -x

It looks like forced a disconnect, see the disconnect, but don't get a connect within allotted 3 seconds:

2022-12-08 20:38:17.630 
2022-12-08 20:38:17.630 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2022-12-08 20:38:17.630 ++Proactor.process_message mqtt_clients/mqtt_disconnected
2022-12-08 20:38:17.630   IN  internal     a.s                                          <-  [mqtt_clients/mqtt_disconnected]                                                            MQTTDisconnectPayload
2022-12-08 20:38:17.631 gridworks:  active -- mqtt_disconnected --> connecting
2022-12-08 20:38:17.631 --Proactor.process_message  path:0x00000009
2022-12-08 20:38:17.631 ----------------------------------------------------------------------------------------
2022-12-08 20:38:23.608 CommTestHelper caught error ERROR. [ERROR waiting link to resubscribe after comm loss] wait_for() timed out after 3.053584098815918 seconds
  [ERROR waiting link to resubscribe after comm loss]
  From test_proactor_comm.py:323
  wait function: <function test_basic_atn_comm_loss.<locals>.<lambda> at 0x10b59ba60>
  ScadaRecorder2 Stats

  Global received by message_type:
        3: [mqtt_connected]
        2: [mqtt_disconnected]
       22: [mqtt_message]
        2: [mqtt_suback]
  LinkStats [gridworks]  num_received: 22  timeouts: 0  comm events: 8
    Received by message_type:
       22: [gw]
       22: [mqtt_message]
    Received by topic:
       20: [gw/dw1-isone-ct-newhaven-orange1/gridworks-ack]
        2: [gw/dw1-isone-ct-newhaven-orange1/gridworks-ping]
    Comm event counts:
        2: [gridworks.event.comm.mqtt.connect]
        2: [gridworks.event.comm.mqtt.fully_subscribed]
        2: [gridworks.event.comm.peer_active]
        2: [gridworks.event.comm.mqtt.disconnect]

Failure was at 61st test.

out.testalot.txt

anschweitzer commented 1 year ago

A related failure might be here.

Relevant output is:

    File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/paho/mqtt/client.py", line 1556, in loop_read
      rc = self._packet_read()
    File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/paho/mqtt/client.py", line 2389, in _packet_read
      byte = self._sock_recv(1)
    File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/paho/mqtt/client.py", line 640, in _sock_recv
      return self._sock.recv(bufsize)
  AttributeError: 'NoneType' object has no attribute 'recv'

The paho thread is crashing because it's socket is None. this could be an artifact of the way we force comm failure. At the same time, we should be able to recover from mqtt crash one way or another.

anschweitzer commented 1 year ago

This failure in test_basic_parent_comm_loss might also be relevant