[Bug] Home Assistant general connection issue when this integration is installed

jjvelar commented 5 months ago

Describe the bug

I have Home Assistant running on a dedicated NUC with HAOS, and a Bambu Lab A1 with AMS Lite. After installing Bambu Lab integration, I get Home Assistant connection issues ramdomly after 6 to 24 hours: couldn't access from my local network neither to the HA IP address, nor to .local but Observer showed "healthy" and "connected" status. All green. When restarted HA in safe mode, no connection issues for a couple of days. Restarted in normal mode and start desabling integrations, but the connection issue showed up. Only when it was the turn of Bambu Lab integration and disabled it, got no connection issues while all the other custom integrations were up and running.

To Reproduce

Just install Bambu Lab custom integration, connect your printer, and wait.

Expected Behaviour

No HA connection issues.

What device are you using?

A1

Diagnostic Output

{
  "home_assistant": {
    "installation_type": "Home Assistant OS",
    "version": "2024.2.0",
    "dev": false,
    "hassio": true,
    "virtualenv": false,
    "python_version": "3.12.1",
    "docker": true,
    "arch": "x86_64",
    "timezone": "Europe/Madrid",
    "os_name": "Linux",
    "os_version": "6.1.74-haos",
    "supervisor": "2024.01.1",
    "host_os": "Home Assistant OS 11.5",
    "docker_version": "24.0.7",
    "chassis": "embedded",
    "run_as_root": true
  },
  "custom_components": {
    "edata": {
      "version": "2023.06.3",
      "requirements": [
        "e-data==1.1.5",
        "python-dateutil>=2.8.2"
      ]
    },
    "anniversary": {
      "version": "0.3.0",
      "requirements": []
    },
    "cryptoinfo": {
      "version": "0.1.7",
      "requirements": []
    },
    "alexa_media": {
      "version": "4.9.0",
      "requirements": [
        "alexapy==1.27.10",
        "packaging>=20.3",
        "wrapt>=1.14.0"
      ]
    },
    "teamtracker": {
      "version": "0.1",
      "requirements": [
        "arrow",
        "aiofiles"
      ]
    },
    "personalcapital": {
      "version": "0.1.2",
      "requirements": [
        "personalcapital==1.0.1"
      ]
    },
    "openmediavault": {
      "version": "0.0.0",
      "requirements": []
    },
    "var": {
      "version": "0.15.0",
      "requirements": []
    },
    "kontomierz_sensor": {
      "version": "0.0.1",
      "requirements": []
    },
    "pvpc_hourly_pricing": {
      "version": "1.1.1",
      "requirements": [
        "aiopvpc>=4.2.1"
      ]
    },
    "localtuya": {
      "version": "5.2.1",
      "requirements": []
    },
    "temperature_feels_like": {
      "version": "0.3.8",
      "requirements": [
        "colorlog==6.7.0",
        "ruff==0.1.1"
      ]
    },
    "worlds_air_quality_index": {
      "version": "1.1.0",
      "requirements": []
    },
    "nodered": {
      "version": "3.1.3",
      "requirements": []
    },
    "tplink_deco": {
      "version": "3.6.0",
      "requirements": [
        "pycryptodome>=3.12.0"
      ]
    },
    "huawei_solar": {
      "version": "1.3.1",
      "requirements": [
        "huawei-solar==2.2.9"
      ]
    },
    "smartir": {
      "version": "1.17.9",
      "requirements": [
        "aiofiles>=0.6.0"
      ]
    },
    "scheduler": {
      "version": "v0.0.0",
      "requirements": []
    },
    "tapo_control": {
      "version": "5.4.13",
      "requirements": [
        "pytapo==3.3.18"
      ]
    },
    "yi_hack": {
      "version": "0.3.6",
      "requirements": []
    },
    "awox": {
      "version": "0.1.5",
      "requirements": [
        "pexpect>=4.6.0",
        "pycryptodome>=3.6.6",
        "pygatt[GATTTOOL]>=4.0.5"
      ]
    },
    "bambu_lab": {
      "version": "2.0.15",
      "requirements": []
    },
    "webrtc": {
      "version": "v3.5.1",
      "requirements": []
    },
    "cryptostate": {
      "version": "2.0.0",
      "requirements": []
    },
    "watchman": {
      "version": "0.5.1",
      "requirements": [
        "prettytable==3.0.0"
      ]
    },
    "gas_station_spain": {
      "version": "0.8.0",
      "requirements": []
    },
    "balance_neto": {
      "version": "0.1.0",
      "requirements": []
    },
    "govee": {
      "version": "2023.11.1",
      "requirements": [
        "govee-api-laggat==0.2.2",
        "dacite==1.8.0"
      ]
    },
    "sonoff": {
      "version": "3.5.4",
      "requirements": [
        "pycryptodome>=3.6.6"
      ]
    },
    "hacs": {
      "version": "1.34.0",
      "requirements": [
        "aiogithubapi>=22.10.1"
      ]
    },
    "frigate": {
      "version": "5.0.1",
      "requirements": [
        "pytz==2022.7"
      ]
    },
    "garmin_connect": {
      "version": "0.2.19",
      "requirements": [
        "garminconnect==0.2.12",
        "tzlocal"
      ]
    },
    "spotcast": {
      "version": "v3.6.30",
      "requirements": []
    },
    "delete": {
      "version": "1.8",
      "requirements": []
    },
    "freeds": {
      "version": "0.12.0",
      "requirements": [
        "websockets"
      ]
    },
    "floureon": {
      "version": "1.0.1",
      "requirements": [
        "pythoncrc",
        "broadlink>=0.18.3"
      ]
    }
  },
  "integration_manifest": {
    "domain": "bambu_lab",
    "name": "Bambu Lab",
    "codeowners": [
      "@greghesp",
      "@AdrianGarside"
    ],
    "config_flow": true,
    "dependencies": [
      "device_automation",
      "ffmpeg",
      "mqtt"
    ],
    "documentation": "https://github.com/greghesp/ha-bambulab",
    "iot_class": "local_push",
    "issue_tracker": "https://github.com/greghesp/ha-bambulab/issues",
    "ssdp": [
      {
        "st": "urn:bambulab-com:device:3dprinter:1"
      }
    ],
    "version": "2.0.15",
    "is_built_in": false
  },
  "data": {
    "config_entry": {
      "entry_id": "67ce294e50441fdead4213a030a0046b",
      "version": 2,
      "minor_version": 1,
      "domain": "bambu_lab",
      "title": "**REDACTED**",
      "data": {
        "device_type": "A1",
        "serial": "**REDACTED**"
      },
      "options": {
        "region": "Europe",
        "email": "**REDACTED**",
        "username": "**REDACTED**",
        "name": "Bambu Lab A1 Combo",
        "host": "192.168.1.88",
        "local_mqtt": false,
        "auth_token": "**REDACTED**",
        "access_code": "**REDACTED**",
        "usage_hours": 290.0047222222222
      },
      "pref_disable_new_entities": false,
      "pref_disable_polling": false,
      "source": "user",
      "unique_id": null,
      "disabled_by": null
    },
    "push_all": {
      "ipcam": {
        "ipcam_dev": "1",
        "ipcam_record": "enable",
        "timelapse": "disable",
        "resolution": "1080p",
        "tutk_server": "disable",
        "mode_bits": 3
      },
      "upload": {
        "status": "idle",
        "progress": 0,
        "message": ""
      },
      "nozzle_temper": 23.5625,
      "nozzle_target_temper": 0,
      "bed_temper": 23.40625,
      "bed_target_temper": 0,
      "chamber_temper": 5,
      "mc_print_stage": "1",
      "heatbreak_fan_speed": "0",
      "cooling_fan_speed": "0",
      "big_fan1_speed": "0",
      "big_fan2_speed": "0",
      "mc_percent": 100,
      "mc_remaining_time": 0,
      "ams_status": 0,
      "ams_rfid_status": 0,
      "hw_switch_state": 0,
      "spd_mag": 100,
      "spd_lvl": 2,
      "print_error": 0,
      "lifecycle": "product",
      "wifi_signal": "-33dBm",
      "gcode_state": "FINISH",
      "gcode_file_prepare_percent": "100",
      "queue_number": 0,
      "queue_total": 0,
      "queue_est": 0,
      "queue_sts": 0,
      "project_id": "49298284",
      "profile_id": "48508693",
      "task_id": "96471770",
      "subtask_id": "96471771",
      "subtask_name": "A1 Guia cable Eje X",
      "gcode_file": "",
      "stg": [],
      "stg_cur": 255,
      "print_type": "idle",
      "home_flag": 322454936,
      "mc_print_line_number": "81742",
      "mc_print_sub_stage": 0,
      "sdcard": true,
      "force_upgrade": false,
      "mess_production_state": "active",
      "layer_num": 444,
      "total_layer_num": 444,
      "s_obj": [],
      "filam_bak": [],
      "fan_gear": 0,
      "nozzle_diameter": "0.4",
      "nozzle_type": "stainless_steel",
      "upgrade_state": {
        "sequence_id": 0,
        "progress": "",
        "status": "",
        "consistency_request": false,
        "dis_state": 0,
        "err_code": 0,
        "force_upgrade": false,
        "message": "0%, 0B/s",
        "module": "",
        "new_version_state": 2,
        "cur_state_code": 0,
        "new_ver_list": []
      },
      "hms": [],
      "online": {
        "ahb": false,
        "rfid": false,
        "version": 942176693
      },
      "ams": {
        "ams": [
          {
            "id": "0",
            "humidity": "5",
            "temp": "0.0",
            "tray": [
              {
                "id": "0",
                "remain": 100,
                "k": 0.019999999552965164,
                "n": 1,
                "tag_uid": "0000000000000000",
                "tray_id_name": "",
                "tray_info_idx": "GFL99",
                "tray_type": "PLA",
                "tray_sub_brands": "",
                "tray_color": "010101FF",
                "tray_weight": "0",
                "tray_diameter": "0.00",
                "tray_temp": "0",
                "tray_time": "0",
                "bed_temp_type": "0",
                "bed_temp": "0",
                "nozzle_temp_max": "240",
                "nozzle_temp_min": "190",
                "xcam_info": "000000000000000000000000",
                "tray_uuid": "00000000000000000000000000000000"
              },
              {
                "id": "1",
                "remain": 100,
                "k": 0.019999999552965164,
                "n": 1,
                "tag_uid": "3277907000000100",
                "tray_id_name": "A00-W1",
                "tray_info_idx": "GFA00",
                "tray_type": "PLA",
                "tray_sub_brands": "PLA Basic",
                "tray_color": "FFFFFFFF",
                "tray_weight": "1000",
                "tray_diameter": "1.75",
                "tray_temp": "55",
                "tray_time": "8",
                "bed_temp_type": "1",
                "bed_temp": "35",
                "nozzle_temp_max": "230",
                "nozzle_temp_min": "190",
                "xcam_info": "34218813F401E8030000003F",
                "tray_uuid": "675F3FB8059E4887B3D7EB8ECF854B5D"
              },
              {
                "id": "2",
                "remain": 100,
                "k": 0.019999999552965164,
                "n": 1,
                "tag_uid": "5C5E973900000100",
                "tray_id_name": "A00-B8",
                "tray_info_idx": "GFA00",
                "tray_type": "PLA",
                "tray_sub_brands": "PLA Basic",
                "tray_color": "0086D6FF",
                "tray_weight": "1000",
                "tray_diameter": "1.75",
                "tray_temp": "55",
                "tray_time": "8",
                "bed_temp_type": "1",
                "bed_temp": "35",
                "nozzle_temp_max": "230",
                "nozzle_temp_min": "190",
                "xcam_info": "8813AC0DE803E8039A99193F",
                "tray_uuid": "797878CAF7DB4EA9A1AFB1B83C4E064D"
              },
              {
                "id": "3",
                "remain": 100,
                "k": 0.019999999552965164,
                "n": 1,
                "tag_uid": "0000000000000000",
                "tray_id_name": "",
                "tray_info_idx": "GFL99",
                "tray_type": "PLA",
                "tray_sub_brands": "",
                "tray_color": "C12E1EFF",
                "tray_weight": "0",
                "tray_diameter": "0.00",
                "tray_temp": "0",
                "tray_time": "0",
                "bed_temp_type": "0",
                "bed_temp": "0",
                "nozzle_temp_max": "240",
                "nozzle_temp_min": "190",
                "xcam_info": "000000000000000000000000",
                "tray_uuid": "00000000000000000000000000000000"
              }
            ]
          }
        ],
        "ams_exist_bits": "1",
        "tray_exist_bits": "f",
        "tray_is_bbl_bits": "f",
        "tray_tar": "255",
        "tray_now": "255",
        "tray_pre": "255",
        "tray_read_done_bits": "f",
        "tray_reading_bits": "0",
        "version": 20,
        "insert_flag": true,
        "power_on_flag": true
      },
      "xcam": {
        "buildplate_marker_detector": true
      },
      "vt_tray": {
        "id": "254",
        "tag_uid": "0000000000000000",
        "tray_id_name": "",
        "tray_info_idx": "GFL99",
        "tray_type": "PLA",
        "tray_sub_brands": "",
        "tray_color": "FFF734FF",
        "tray_weight": "0",
        "tray_diameter": "0.00",
        "tray_temp": "0",
        "tray_time": "0",
        "bed_temp_type": "0",
        "bed_temp": "0",
        "nozzle_temp_max": "240",
        "nozzle_temp_min": "190",
        "xcam_info": "000000000000000000000000",
        "tray_uuid": "00000000000000000000000000000000",
        "remain": 0,
        "k": 0.019999999552965164,
        "n": 1
      },
      "lights_report": [
        {
          "node": "chamber_light",
          "mode": "off"
        }
      ],
      "command": "push_status",
      "msg": 0,
      "sequence_id": "22242"
    },
    "get_version": {
      "command": "get_version",
      "sequence_id": "0",
      "module": [
        {
          "name": "ota",
          "project_name": "N2S",
          "sw_ver": "01.02.00.01",
          "hw_ver": "OTA",
          "sn": "**REDACTED**",
          "flag": 3
        },
        {
          "name": "esp32",
          "project_name": "N2S",
          "sw_ver": "01.08.25.64",
          "hw_ver": "AP05",
          "sn": "**REDACTED**",
          "flag": 0
        },
        {
          "name": "mc",
          "project_name": "N2S",
          "sw_ver": "00.00.23.46",
          "loader_ver": "00.00.00.32",
          "hw_ver": "MC02",
          "sn": "**REDACTED**",
          "flag": 0
        },
        {
          "name": "th",
          "project_name": "N2S",
          "sw_ver": "00.00.07.69",
          "loader_ver": "00.00.00.26",
          "hw_ver": "TH01",
          "sn": "**REDACTED**",
          "flag": 0
        },
        {
          "name": "ams_f1/0",
          "project_name": "",
          "sw_ver": "00.00.07.92",
          "loader_ver": "00.00.00.00",
          "ota_ver": "00.00.00.00",
          "hw_ver": "AMS_F102",
          "sn": "**REDACTED**",
          "flag": 0
        }
      ],
      "result": "success",
      "reason": ""
    }
  }
}

Log Extracts

No response

Other Information

No response

AdrianGarside commented 4 months ago

Please can you check the logs for anything that would shed light on this. If nothing is present initially please enable debug logging for the integration and see if that shows anything interesting.

jjvelar commented 4 months ago

Please can you check the logs for anything that would shed light on this. If nothing is present initially please enable debug logging for the integration and see if that shows anything interesting.

Will do that.

WayneManion commented 4 months ago

I am also experiencing frequent problems with unresponsive Home Assistant front end. I run Hass OS in a VM. I can ping the VM and I can get into the console through the VM's host machine. Today was the first time in a while Home Assistant web UI has frozen up in a while that I had some time to look into it. I looked at my logs and at the end I see this:

2024-02-15 14:50:09.280 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: A Chamber Image thread outer exception occurred:
2024-02-15 14:50:09.281 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: Exception. Type: <class 'RuntimeError'> Args: Event loop is closed
2024-02-15 14:50:09.582 WARNING (Thread-26 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_27] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/homeassistant/homeassistant/components/wake_on_lan/switch.py", line 153, in update
    status = sp.call(ping_cmd, stdout=sp.DEVNULL, stderr=sp.DEVNULL)
  File "/usr/local/lib/python3.12/subprocess.py", line 391, in call
    return p.wait(timeout=timeout)
  File "/usr/local/lib/python3.12/subprocess.py", line 1264, in wait
    return self._wait(timeout=timeout)
  File "/usr/local/lib/python3.12/subprocess.py", line 2046, in _wait
    (pid, sts) = self._try_wait(0)
  File "/usr/local/lib/python3.12/subprocess.py", line 2004, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
2024-02-15 14:50:09.749 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_27] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/homeassistant/homeassistant/components/wake_on_lan/switch.py", line 153, in update
    status = sp.call(ping_cmd, stdout=sp.DEVNULL, stderr=sp.DEVNULL)
  File "/usr/local/lib/python3.12/subprocess.py", line 391, in call
    return p.wait(timeout=timeout)
  File "/usr/local/lib/python3.12/subprocess.py", line 1264, in wait
    return self._wait(timeout=timeout)
  File "/usr/local/lib/python3.12/subprocess.py", line 2046, in _wait
    (pid, sts) = self._try_wait(0)
  File "/usr/local/lib/python3.12/subprocess.py", line 2004, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
2024-02-15 14:50:10.662 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] An exception occurred processing a message:
2024-02-15 14:50:10.663 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Exception type: <class 'RuntimeError'>
2024-02-15 14:50:10.663 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Exception data: Event loop is closed
2024-02-15 14:50:13.213 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: A Chamber Image thread outer exception occurred:
2024-02-15 14:50:13.213 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: Exception. Type: <class 'RuntimeError'> Args: Event loop is closed
2024-02-15 14:50:16.127 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: A Chamber Image thread outer exception occurred:
2024-02-15 14:50:16.128 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: Exception. Type: <class 'RuntimeError'> Args: Event loop is closed
2024-02-15 14:50:16.763 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] An exception occurred processing a message:
2024-02-15 14:50:16.763 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Exception type: <class 'RuntimeError'>
2024-02-15 14:50:16.763 ERROR (Thread-14 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Exception data: Event loop is closed
2024-02-15 14:50:19.022 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: A Chamber Image thread outer exception occurred:
2024-02-15 14:50:19.022 ERROR (Thread-17) [custom_components.bambu_lab.pybambu] P1P: Exception. Type: <class 'RuntimeError'> Args: Event loop is closed

I have a full log file, but it is 87 MB (1 MB when 7-zipped). I have a Bambu P1P printer with the P1S conversion kit. I bought my P1P well after launch, so it came with the initially-optional camera pre-installed, for whatever that is worth.

AdrianGarside commented 4 months ago

Did you restart HA? Some of those errors I expect to see as I don't correctly handle cleanly shutting down when HA is restarting but it's benign since HA is going away. Multiple errors from the P1P chamber image thread might be expected as I recently allowed it to retry connection up to 10 times with a short sleep between each.

If you can connect to your HA host when HA itself is not being responsive, getting the top CPU using processes from the 'top' command might help shed light on things.

WayneManion commented 4 months ago

I did not restart Home Assistant. I ran the command to restart the Home Assistant process well after I grabbed this log file. Something about core (edit: ha core start. I'm not at the computer right now.

AdrianGarside commented 4 months ago

OK, that probably means those errors are very relevant then. When you get chance could you maybe drop the logs on onedrive or similar and send me a link?

WayneManion commented 4 months ago

I had two more incidents where my Home Assistant web UI became unavailable. By the time I noticed I think any CPU use spike was over. When I looked in top, there was essentially no CPU use. I run Home Assistant OS in a VM on an x86 host with lots of cores (20?) and lots of RAM (128 GB). I allocate 4 GB RAM and four cores to the Home Assistant OS VM. Here are links to the home-assistant.log files:

https://www.dropbox.com/scl/fi/gucw4x2etxm1ymydpox6w/home-assistant.log.01.7z?rlkey=3dzklj0kd0vx9xziv6n6n9uzc&dl=0 https://www.dropbox.com/scl/fi/zb9u9bxuwxud4amr92wjg/home-assistant.log.02.7z?rlkey=mfplcxhbrygpbvlkhvn4xmt87&dl=0 https://www.dropbox.com/scl/fi/zews4prypmmtym9ks6etc/home-assistant.log.03.7z?rlkey=0gtwj86jrknvbsx6wyixgsr7m&dl=0

All of these logs are from times where Home Assistant's web UI has become inaccessible over the last couple of days. They don't all end exactly the same way, but they do all finish with error messages about Bambu, paho, and mqtt in the last 100 lines. In all cases, the Home Assistant OS was still running, so I could copy out the log file before attempting to restart Home Assistant.

Edit to add: I have a Bambu P1P with P1S conversion kit and multiple Klipper printers, all with at least one camera. I have a Home Assistant Lovelace page set up to be able to monitor all of them. This is just to say that I give the Home Assistant Bambu integration a decent workout.

Second edit: My printer is connected to the integration via LAN mode, if that matters.

AdrianGarside commented 4 months ago

Ah, my old friend 'list changed size during iteration'. Rarely happens with this integration. I still have no idea what's causing that since it's happening when HA calls into the integration to retrieve sensor data and I haven't been able to work out which sensor data changing on it as it's reading it. But I had to make a fix for a deadlock it caused.

RuntimeError: list changed size during iteration During handling of the above exception, another exception occurred:

This is putting HA into a tight loop I think but this doesn't appear to be due to this integration. The other exception it's complaining about is from 'zigpy' which is ZigBee.

Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/clusters/general.py", line 2187, in _handle_cluster_request await self._handle_query_next_image( File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/clusters/general.py", line 2251, in _handle_query_next_image await self.query_next_image_response( File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/init.py", line 415, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/device.py", line 488, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/device.py", line 322, in request await send_request() File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 833, in request await self.send_packet( File "/usr/local/lib/python3.12/site-packages/zigpy_znp/zigbee/application.py", line 934, in send_packet async with self._limit_concurrency(): File "/usr/local/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 761, in _limit_concurrency async with self._concurrent_requests_semaphore: File "/usr/local/lib/python3.12/site-packages/zigpy/datastructures.py", line 145, in aenter await self.acquire() File "/usr/local/lib/python3.12/site-packages/zigpy/datastructures.py", line 111, in acquire await fut asyncio.exceptions.CancelledError: Final process shutdown

I can confirm that once that's happening the main HA thread is in a tight loop which would absolute cause the symptoms you're seeing.

AdrianGarside commented 4 months ago

So in your case I think the errors from this integration are just the benign ones that kick in at integration unload because HA has finally tried to shut down due to the zigpy errors.

jjvelar commented 4 months ago

Hi. Still having the original issue. Not sure which kind of logs or additional information I can provide. Thanks in advance for your help. Best regards,

José

AdrianGarside commented 4 months ago

I need a full debug log capturing the repro.

jjvelar commented 4 months ago

config_entry-bambu_lab-dd3cfd33cc55705ce9b78f9a17483650.json host_2024-02-29T16-15-20.898Z.log supervisor_2024-02-29T16-15-08.490Z.log home-assistant_2024-02-29T16-14-50.034Z.log

here you've got all I can get.

AdrianGarside commented 4 months ago

2024-02-29 17:14:19.472 ERROR (SyncWorker_19) [homeassistant.components.influxdb] Cannot connect to InfluxDB due to 'HTTPConnectionPool(host='192.168.1.144', port=8086): Max retries exceeded with url: /write?db=home_assistant (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff802dfa0f0>: Failed to establish a new connection: [Errno 111] Connection refused'))'. Please check that the provided connection details (host, port, etc.) are correct and that your InfluxDB server is running and accessible. Retrying in 60 seconds.

Is InfluxDB the database you use for HA or is that specific to some other integration?

Does the instructions here to configure HA to stop spamming the database with image update events make any difference? https://github.com/greghesp/ha-bambulab/issues/423#issuecomment-1913307657

jjvelar commented 4 months ago

Thanks for your reply, @AdrianGarside . My database is SQLite. The InfluxDB error is weird as a unistalled the InfluxDB addon a while ago so when I checked in the configuration.yaml file I found I forgot to remove the "influxdb:" section. I've commented it and will se if this works. I have also excluded image, wifi, etc sensors from my "recorder:" section and will see how it goes. I will keep you posted.

AdrianGarside commented 4 months ago

Other than those suspicious db errors which probably aren't related, I didn't see anything standing but you don't have DEBUG logging turned of for this integration so there's virtually no logging try to understand if it's doing something unexpected (specifically getting into a tight loop).

To get full debug logging requires an edit to configuration.yaml:

logger: default: info logs: custom_components.bambu_lab: debug

Note that this will generate a lot more logging output. So if, for e.g., you're running on a raspberry pi off an sd card - it will shorten the life of that card a lot if left running permanently like this but a couple of days would be fine. And proper ssd disks are not affected significantly as they have vastly larger write count capacity. Even endurance rated sd cards still have a tiny write count capacity in comparison.

jjvelar commented 4 months ago

Ok. Will do it if it hangs again. thanks once more for your support. Much appreciated, @AdrianGarside

AdrianGarside commented 4 months ago

I'll do my best to help you work out how the integration is misbehaving but since I cannot repro it - my only mechanism is going to be debugging logs. So the hope is to find something in those that either immediately tells me where the problem lies or at least gives me an indication of the general area so I can potentially add more logging to narrow it down.

AdrianGarside commented 4 months ago

One thing that could be useful is when you start seeing performance problems - if you can connect to the HA host/container and run the 'top' command to see if there's a process consuming 100% cpu. To help determine if this is a cpu tight loop issue or some kind of hang.

jjvelar commented 4 months ago

Will do that. Thanks again

jjvelar commented 4 months ago

Hi @AdrianGarside,

I had last night the issue again and, as you anticipated, solving the influxdb error and removing from the recorder wifi, image and other Bambu Lab sensors didn't help.

I have activated the logs as you suggested so will see what we get from there the next time the issue happens.

What I couldn't do was to run the "top" command. I have Home Assistant OS installed so can run the command from the terminal addon, but this option was not available during the issue, when the connection was down.

Then, I tried it attaching a monitor and a keyboard directly to the NUC that is running Home Assistant OS but despite I could run many other commands, "top" was not recognized.

Is there a way to run "top" when connected directly to the NUC?

Thanks.

jjvelar commented 4 months ago

Hi @AdrianGarside

I got the issue again. Attached the log files with the debug info activated as per your request.

Thanks in advance for your help.

home-assistant.log.zip

I have also been able to run "top" command via a putty connection, so here you've got the screen capture:

AdrianGarside commented 4 months ago

Do you know how long it was in a bad state before you restarted the home assistant?

The logs don't seem to show anything obvious but the previous instance shutdown at 8.04.16 and the next log starts at 8.56.40. So apparently your HA instance wasn't running for over 50 minutes? When you restarted it just before you got the logs - what did you observe and what actions did you take to get things working again? Do you know why the HA instance shutdown or is that an unexpected shutdown and the reason the HA instance is becoming unresponsive (since it's not running!)?

Another thing to keep an eye on is memory usage - does top show a steady increase in memory usage while HA is running normally?

My current not well supported theory is that something caused you HA instance to restart and somehow the integration caused that shutdown to hang blocking a timely restart. But the errors from the integration at shutdown are somewhat normal - I'm planning to see if I can make it more gracefully handle shutdown without the errors about the mqtt connection event loop being closed but AFAIK those were basically benign. Perhaps they aren't as benign as I thought in all scenarios. And right up until the shutdown starts the integration is still receiving data from your printer.

Actually, there is an uncaught thread exception on _thread_main from the mqtt library once the shutdown starts that isn't something I've seen as part of the normal and benign connection closed errors that I catch during HA shutdown. Given the thread involved it's definitely plausible that error breaks the shutdown and leaves HA in zombie state.

TLDR; I'd like anything more you can tell me about that shutdown to see if I can reproduce the scenario. My shutdowns are only ever from two causes - restart via the HA UI on my production instance (rarely when I update the integration via HACS) or a restart of the docker instance (production and test instances). The latter would never hide any permanent shutdown hangs.

jjvelar commented 4 months ago

It was probably disconnected for about 40-50 minutes, so it is aligned to what you see in the logs. The host reboot is done manually by me as it is the only way to connect again with Home Assistant. By "connect again" I mean that when in this situation and I try to open local Home Assistant IP, I get "err_connection" in the browser (or a red banner with a similar message in the Home Assistant Companion app). The only way to being able to access again to Home Assistant is manually reboot the host (I have Home Assistant OS installed in a dedicated NUC) I do the manual reboot by connecting to the NUC with a keyboard and typing "host reboot". Regarding memory usage, I have a sensor in Home Assistant that records it. Attached below the graph. I have highlighed when it hanged out and when I manually rebooted.

AdrianGarside commented 4 months ago

So it looks like there’s a slow memory leak in HA but probably not enough to explain it seemingly trying to restart itself. And when it does try to restart it seems that triggers an exception from the mqtt client we use that seems to hang/block HA.

I’ve been planning to try to more gracefully handle the restart scenario but I’m not convinced I can actually prevent the mqtt client throwing an exception on the main thread. It’s not supposed to be doing anything on that thread…

jjvelar commented 4 months ago

In case it helps, I have attached the memory usage sensor chart for the last coulpe of days. I identified when the error connection happened yesterday and the same time today. As you can see, the slow memory leak is still happening but, as I disabled yesterday your integration, no error connection today,

AdrianGarside commented 4 months ago

Do you have anything scheduled to happen in HA at 8am?

jjvelar commented 4 months ago

No big deal: just reboot all cameras for cleansing and switch the heater off.

AdrianGarside commented 4 months ago

I'll work on trying to improve the integration handling of HA restart but I'd love to know why HA tried to restart by itself since that's very odd to me.

jjvelar commented 4 months ago

That's really much appreciated, @AdrianGarside . But there is something I don't understand: you say "handling of HA restart" but my HA is not restarting. It just refuses the connection lo the local or external (I use Nabu Casa) IP address. In any case, I am more than happy to beta test.

AdrianGarside commented 4 months ago

Per your logs, it was shutting down at 8am when it became unresponsive. The integration appears to have caused it to hang while shutting down due the mqtt client throwing an exception on the main HA thread. That's why I'm asking what might have caused your HA instance to decide to restart. A manual restart triggers a few caught exceptions from the mqtt client in the integration but I've never seen the uncaught one on the main thread that was in your logs.

jjvelar commented 4 months ago

happy to be special :-) let me know how can I help: debugging, testing...

AdrianGarside commented 4 months ago

Please give this version a try: https://github.com/greghesp/ha-bambulab/releases/tag/v2.0.16-dev1

jjvelar commented 4 months ago

Please give this version a try: https://github.com/greghesp/ha-bambulab/releases/tag/v2.0.16-dev1

Done. I have also enabled the full debug for the integration. Will report how it goes. Thanks for your support. Very much appreciated.

jjvelar commented 3 months ago

Hi @AdrianGarside, I got a connection error today at 2:40pm CET. Attached the log files: logs.zip Let me know if there is anything else I can do to help you to solve this issue. Thanks!

jjvelar commented 3 months ago

UPDATE I have Uptime Kuma installed so could trace when I lost connection with both Home Assistant locally and through Remote UI provided by NabuCasa and it looks the last active connection happened just before a print was ended. That print was finalized at 2:16pm CET. I hope this helps, @AdrianGarside

AdrianGarside commented 3 months ago

This is definitely related to the other issue with similar symptoms:

2024-03-11 14:20:29.516 ERROR (MainThread) [homeassistant] Error doing job: an error occurred during closing of asynchronous generator <async_generator object BoundAsyncStream.aiter at 0x7fc529187850> Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/runner.py", line 188, in run return loop.run_until_complete(setup_and_run_hass(runtime_config)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/asyncio/base_events.py", line 672, in run_until_complete self.run_forever() File "/usr/local/lib/python3.12/asyncio/base_events.py", line 639, in run_forever self._run_once() File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1936, in _run_once handle = heapq.heappop(self._scheduled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: list changed size during iteration

During handling of the above exception, another exception occurred:

In this case it's less clear what's triggering it but it appers that HA is calling into the integration off the main thread to get data and hitting the list changed size case and that's taking out the main thread.

My only guess right now is that the list in question is the HMS error list. But maybe hitting this as a print was ending will suggest something else to me.

jjvelar commented 3 months ago

Thanks for replaying. How can I help? What else should I test or try?

El El lun, 11 mar 2024 a las 23:22, AdrianGarside @.***> escribió:

This is definitely related to the other issue with similar symptoms:

2024-03-11 14:20:29.516 ERROR (MainThread) [homeassistant] Error doing job: an error occurred during closing of asynchronous generator <async_generator object BoundAsyncStream.aiter at 0x7fc529187850> Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/runner.py", line 188, in run return loop.run_until_complete(setup_and_run_hass(runtime_config)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/asyncio/base_events.py", line 672, in run_until_complete self.run_forever() File "/usr/local/lib/python3.12/asyncio/base_events.py", line 639, in run_forever self._run_once() File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1936, in _run_once handle = heapq.heappop(self._scheduled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: list changed size during iteration

During handling of the above exception, another exception occurred:

In this case it's less clear what's triggering it but it appers that HA is calling into the integration off the main thread to get data and hitting the list changed size case and that's taking out the main thread.

My only guess right now is that the list in question is the HMS error list. But maybe hitting this as a print was ending will suggest something else to me.

— Reply to this email directly, view it on GitHub https://github.com/greghesp/ha-bambulab/issues/462#issuecomment-1989547998, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALKMG4Q47AAUEJP2SRGG2V3YXY4B5AVCNFSM6AAAAABDBANARKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBZGU2DOOJZHA . You are receiving this because you authored the thread.Message ID: @.***>

AdrianGarside commented 3 months ago

I'm going to add some more logging to see if I can narrow down what data is being requested leading to that list changed size exception. The infrequency of repro is another potential hint it's the HMS error list.

jjvelar commented 3 months ago

Ok. Thanks again for your support. Just let me know and I will test it.

El El lun, 11 mar 2024 a las 23:30, AdrianGarside @.***> escribió:

I'm going to add some more logging to see if I can narrow down what data is being requested leading to that list changed size exception. The infrequency of repro is another potential hint it's the HMS error list.

— Reply to this email directly, view it on GitHub https://github.com/greghesp/ha-bambulab/issues/462#issuecomment-1989557259, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALKMG4RYMG4L4JMB4DIOIVTYXY5BRAVCNFSM6AAAAABDBANARKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBZGU2TOMRVHE . You are receiving this because you authored the thread.Message ID: @.***>

AdrianGarside commented 3 months ago

New release with a bit more logging around the HMS error list setting/retrieval to see if that proves it's involved in this: https://github.com/greghesp/ha-bambulab/releases/tag/v2.0.16-dev2

AdrianGarside commented 3 months ago

Also - can you tell me more about these sensors?

2024-03-10 21:20:56.511 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Unknown 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Info 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Common 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Serious 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Fatal 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS

jjvelar commented 3 months ago

New release with a bit more logging around the HMS error list setting/retrieval to see if that proves it's involved in this: https://github.com/greghesp/ha-bambulab/releases/tag/v2.0.16-dev2

Done. Will keep you updated.

jjvelar commented 3 months ago

Also - can you tell me more about these sensors?

2024-03-10 21:20:56.511 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Unknown 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Info 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Common 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Serious 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS_Fatal 2024-03-10 21:20:56.512 INFO (MainThread) [homeassistant.components.mqtt.discovery] Found new component: sensor A1_A1 HMS

This is automatically autodiscovered by my MQTT server as is shown as a collection of sensors under main "homeassistant" topic.

First time I saw it, I thought about a new feature of Bambu Lab MQTT implementation, but when I had a look into the content, it is just the configuration of the sensor, without real data. Below an example of one of the automatically created sensors for the printer IP address.

{"name":"IP Address","entity_category":"diagnostic","device":{"identifiers":["{\"bambu_lab\":\"A1_A1\",\"type\":\"printer\",\"serial\":\"xxx\",\"model\":\"A1\"}","A1_A1","xxx"],"manufacturer":"Bambu Labs","model":"A1","name":"A1_A1"},"unique_id":"A1_A1_IP_Address","object_id":"A1_A1_IP_Address","has_entity_name":false,"state_topic":"homeassistant/sensor/A1_A1/IP_Address/state","json_attributes_topic":"homeassistant/sensor/A1_A1/IP_Address/attr","availability_topic":"homeassistant/sensor/A1_A1/status/state"}

AdrianGarside commented 3 months ago

I have no idea how that's getting autodiscovered - this integration doesn't push the payload I see in your screenshot. Do you also have the node red bambu integration enabled? The HMS severity levels are something he added to both integrations.

jjvelar commented 3 months ago

I have no idea how that's getting autodiscovered - this integration doesn't push the payload I see in your screenshot. Do you also have the node red bambu integration enabled? The HMS severity levels are something he added to both integrations.

I tried WolfwithSword's Node-RED integration but it was too complex.

Then I tried your integration and as it was hanging my Home Assistant, I created in Node-RED a MQTT server with my printer IP to get some "easy" topics from there: print progress, nozzle and bed temperatures, etc.

No idea where this "A1_A1" is coming from as I didn't created anything called "A1_A1". As you can see below, there are two devices, one for the printer and another one for the AMS, but all sensors and not available or are "unknown".

jjvelar commented 3 months ago

In any case, I had last night the same issue again: Home Assistant got disconnected at around 2am. Logs are attached below:

logs_2.zip

jjvelar commented 3 months ago

I have no idea how that's getting autodiscovered - this integration doesn't push the payload I see in your screenshot. Do you also have the node red bambu integration enabled? The HMS severity levels are something he added to both integrations.

I tried WolfwithSword's Node-RED integration but it was too complex.

Then I tried your integration and as it was hanging my Home Assistant, I created in Node-RED a MQTT server with my printer IP to get some "easy" topics from there: print progress, nozzle and bed temperatures, etc.

No idea where this "A1_A1" is coming from as I didn't created anything called "A1_A1". As you can see below, there are two devices, one for the printer and another one for the AMS, but all sensors and not available or are "unknown".

in case it helps, attached the diagnostic downloads of these two "ghost" A1_A1 and AMS_0_A1 devices:

mqtt-58c861beb264424cec95a0ddd430941a-A1_A1-268220bf89501616db9d13a5c42d8c47.json mqtt-58c861beb264424cec95a0ddd430941a-AMS_0_A1-fb1825581399299642c7c66d49bbdfe7.json

AdrianGarside commented 3 months ago

In any case, I had last night the same issue again: Home Assistant got disconnected at around 2am. Logs are attached below:

logs_2.zip

This latest repro looks different. The last uncaught thread exception isn't the MainThread as in your prior repros / the other issue and isn't due to the list changed size exception.

This is the thread that died: 2024-03-13 00:28:21.246 INFO (Thread-4 (_thread_main)) [homeassistant.components.mqtt.client] Connected to MQTT server core-mosquitto:1883 (0) ... 2024-03-13 01:49:21.654 ERROR (Thread-4 (_thread_main)) [paho.mqtt.client] Caught exception in on_message: Event loop is closed 2024-03-13 01:49:21.658 ERROR (Thread-4 (_thread_main)) [root] Uncaught thread exception

I do then see this integration fall over but it's still successfully receiving data from the printer fine - it's just when calling into HA to tell it there's updated sensor data that it then receives the repeated exception that hit the thread above by the looks of it.

2024-03-13 01:49:21.852 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Received data from: A1 2024-03-13 01:49:21.854 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Manually updated bambu_lab data 2024-03-13 01:49:21.854 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] An exception occurred calling async_set_updated_data(): 2024-03-13 01:49:21.854 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Exception type: <class 'RuntimeError'> 2024-03-13 01:49:21.855 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Exception data: Event loop is closed 2024-03-13 01:49:22.998 DEBUG (Thread-7) [custom_components.bambu_lab] Manually updated bambu_lab data 2024-03-13 01:49:22.998 ERROR (Thread-7) [custom_components.bambu_lab] An exception occurred calling async_set_updated_data(): 2024-03-13 01:49:22.999 ERROR (Thread-7) [custom_components.bambu_lab] Exception type: <class 'RuntimeError'> 2024-03-13 01:49:22.999 ERROR (Thread-7) [custom_components.bambu_lab] Exception data: Event loop is closed 2024-03-13 01:49:23.890 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Received data from: A1 2024-03-13 01:49:25.820 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Received data from: A1 2024-03-13 01:49:27.701 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Received data from: A1 2024-03-13 01:49:29.339 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab.pybambu] Received data from: A1 2024-03-13 01:49:29.341 DEBUG (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Manually updated bambu_lab data 2024-03-13 01:49:29.342 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] An exception occurred calling async_set_updated_data(): 2024-03-13 01:49:29.342 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Exception type: <class 'RuntimeError'> 2024-03-13 01:49:29.343 ERROR (Thread-5 (mqtt_listen_thread)) [custom_components.bambu_lab] Exception data: Event loop is closed

Looking back at your previous logs with this fresh info I see the similar pattern in the prior repro - it wasn't the HA MainThread that died but the same Thread-4 (_thread_main) and then from that point onwards this integration was still receiving printer data but hitting an error every time it tried to tell HA there was new data.

And the repro before that was the same.

So now I feel like I understand what's going on here even less. :(. Trying to eliminate the remnants of the node red might be worth a try if it's really the connection to your mosquitto server that's dying and triggering the complete failure.

jjvelar commented 3 months ago

I have disabled my Node-RED MQTT broker node and enabled again your integration. Let's see what happens. I will keep you updated.

jjvelar commented 3 months ago

Hi, New "connection refused" error. This time, only your integration was enabled as I have disabled my Node-RED MQTT broker node. Attached the log files. I have downloaded them via putty before rebooting the host.

home-assistant.log.zip

greghesp / ha-bambulab