tomaae / homeassistant-mikrotik_router

Mikrotik router integration for Home Assistant
Apache License 2.0
292 stars 48 forks source link

[Bug] Integration loses connection to switch API after some time. #288

Closed Foxi352 closed 10 months ago

Foxi352 commented 1 year ago

Describe the issue

I use Mikrotik custom integration to manage 2 different switches. On one of the switches i have automations enabling or disabling network ports.

Every one to two days the integration stops working and it seems to be disconnected from the switch API and automations do no longer work. The integration then also shows up proposing an update from the current version to the unknown version.

A simple restart of the integration fixes it for the next 1 to 2 days. This happens randomly and is not predictable as like "every x hours after integration restart".

How to reproduce the issue

Simply let it run for some days performing a scheduled automation from time to time.

Expected behavior

If connection drops for whatever reason, it should be handled gracefully and integration should reconnect.

Screenshots

Screenshot 2023-06-20 at 07 55 32

Software versions

Traceback/Error logs

Here is a log from one of the scheduled automations that disabled a switch port.

2023-06-20 07:00:00.848 ERROR (MainThread) [homeassistant.components.automation.heizung_buderus_network_port_neustarten] Heizung - Buderus Network port neustarten: Error executing script. Unexpected error for call_service at pos 1: Connection unexpectedly closed.
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 452, in _async_step
await getattr(self, handler)()
File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 685, in _async_call_service_step
await service_task
File "/usr/src/homeassistant/homeassistant/core.py", line 1910, in async_call
task.result()
File "/usr/src/homeassistant/homeassistant/core.py", line 1950, in _execute_service
await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)(
File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 226, in handle_service
await service.entity_service_call(
File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 811, in entity_service_call
future.result() # pop exception if have
^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1034, in async_request_call
await coro
File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 851, in _handle_entity_call
await result
File "/config/custom_components/mikrotik_router/switch.py", line 174, in async_turn_off
self._ctrl.set_value(path, param, value, mod_param, True)
File "/config/custom_components/mikrotik_router/mikrotik_controller.py", line 385, in set_value
return self.api.set_value(path, param, value, mod_param, mod_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/config/custom_components/mikrotik_router/mikrotikapi.py", line 228, in set_value
for tmp in response:
File "/usr/local/lib/python3.11/site-packages/librouteros/api.py", line 107, in __iter__
yield from self('print')
File "/usr/local/lib/python3.11/site-packages/librouteros/api.py", line 110, in __call__
yield from self.api(
File "/usr/local/lib/python3.11/site-packages/librouteros/api.py", line 35, in __call__
yield from self.readResponse()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/api.py", line 67, in readResponse
reply_word, words = self.readSentence()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/api.py", line 53, in readSentence
reply_word, words = self.protocol.readSentence()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/protocol.py", line 187, in readSentence
sentence = tuple(word for word in iter(self.readWord, ''))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/protocol.py", line 187, in <genexpr>
sentence = tuple(word for word in iter(self.readWord, ''))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/protocol.py", line 196, in readWord
byte = self.transport.read(1)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/librouteros/connections.py", line 27, in read
raise ConnectionClosed('Connection unexpectedly closed.')
librouteros.exceptions.ConnectionClosed: Connection unexpectedly closed.

Additional context

tomaae commented 1 year ago

Seems to be problem with actually router. its closing connection.

Foxi352 commented 1 year ago

FYI, i don't know if that changes anything, but it's a 24 port POE switch, not a router. I don't know if the connection to API stays open the whole time or is established only every 30 seconds on poll, respectively on service call ?

Simply restarting the integration solves the problem every time. I never touch the switch is in a rack in my server room.

Even while the integration is in error state, i can launch a python script from a linux VM which uses the netmiko lib and can still succesfully do the task. I used that script with cronjob before i had the integration in place. It worked for years.

Don't hesitate if you want me to do further tests or need some more info.

tomaae commented 1 year ago

model does not matter, since they are all running RouterOS. Its the same thing, they can act as routers, just not optimally because of internal connections. Baed on that, it seems like there is problem with reconnecting to device after connection crashes. I will have to look if that is a global issue. It may have gone unnoticed as mikrotik devices are usually rock solid.

tomaae commented 11 months ago

I have tested it and cannot reproduce this issue:

2023-09-18 09:19:13.304 ERROR (SyncWorker_2) [custom_components.mikrotik_router.mikrotikapi] Mikrotik 10.0.1.127 error while building list for path /system/resource : Connection unexpectedly closed.
2023-09-18 09:19:13.306 ERROR (MainThread) [custom_components.mikrotik_router.coordinator] Error fetching mk6 data: Mikrotik Disconnected
2023-09-18 09:22:48.312 WARNING (SyncWorker_5) [custom_components.mikrotik_router.mikrotikapi] Mikrotik Reconnected to 10.0.1.127
2023-09-18 09:22:48.373 INFO (MainThread) [custom_components.mikrotik_router.coordinator] Fetching mk6 data recovered

Can you give me more information?

Foxi352 commented 11 months ago

I have two Mikrotik switches. One is just integrated, but i don't do anything (yet) with the device / entities for now. The second switch is the one this ticket refers to.

On that second switch, i have two automations running:

Sometimes this does run for days without problems, and sometimes the problems occurs once or twice a day. The said problems are the ones described in this ticket:

When i reload the integration just for the switch in error, everything starts working again. Until the next time it errors out.

I upgraded yesterday to HA 2023.9.2 and Mikrotik integration v2.1.4. Since then it did not error out until now. I propose to wait for some days and i will keep you informed if maybe it was resolved as sideeffect of another fix ...

Foxi352 commented 10 months ago

The problem appeared again this morning. The core-01 switch is still working, never had that problem, but i don't do anything with it in HA. The second one, distri-01, has all entities unavailable: Screenshot 2023-09-23 at 09 51 08

Screenshot 2023-09-23 at 09 52 04

Just reload / reinitialise distri-01 and it's good again for some hours / days.

What can i provide you to help with this one ?

tomaae commented 10 months ago

Check that you are not actually touching API port or connection itself. Could be also something like custom DDoS protection. Also check for rules with tarpit

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 5 days with no activity.