halfgaar / FlashMQ

FlashMQ is a fast light-weight MQTT broker/server, designed to take good advantage of multi-CPU environments
https://www.flashmq.org/
Open Software License 3.0
173 stars 24 forks source link

keepalive interval vs paho-mqtt #82

Closed skinkie closed 4 months ago

skinkie commented 4 months ago

I have been faced with a very odd situation where clients seem to be keeping reconnecting. Could not put my finger behind it. I might have reduced it to paho doing a default keep alive interval of 60s, and flashmq setting 5 seconds(?)

When updated the paho connect keepalive interval to 15, the client stayed connected.

I noticed that the situation is observed by a subscribe only process, have low amounts of traffic. I don't know if this is the exclusive case.

halfgaar commented 4 months ago

Did you mean the log segment from #83 you created and closed to be for this ticket?

I can't really follow what you're saying. Your log shows your clients connects with a keep-alive timeout of 60s. The MQTT specs says the clients should be kicked after 1.5 times that, which FlashMQ does. I don't see any log lines to that effect. Instead, your clients connect and disconnect cleanly, about every 15 seconds. They connect and disconnect within 1 ms.

Can you show more client-side code? And, if the log I was looking at is not the correct one, can you also show logs?

skinkie commented 4 months ago

Did you mean the log segment from #83 you created and closed to be for this ticket?

Not really. I thought the bug of #83 was that the fd's were exactly the same, like it looked like it kept closing the same file descriptor. I thought that was potentially a bug. But I understood why it was creating new connections (being the one shot, mqttclient.publish method).

The issue I have here is the opposite, long running subscribers suddenly disconnecting, or being disconnected.

I can't really follow what you're saying. Your log shows your clients connects with a keep-alive timeout of 60s. The MQTT specs says the clients should be kicked after 1.5 times that, which FlashMQ does. I don't see any log lines to that effect. Instead, your clients connect and disconnect cleanly, about every 15 seconds. They connect and disconnect within 1 ms.

I'll try to come with with something reproducable... as in a few lines of code. If it is really as easy as 15 vs 60s that must be possible.

halfgaar commented 4 months ago

You could post some FlashMQ log lines already. Perhaps I can already see something abnormal in them, and that saves you time.

skinkie commented 4 months ago

You could post some FlashMQ log lines already. Perhaps I can already see something abnormal in them, and that saves you time.

[2024-02-16 17:47:17.440] [NOTICE] Removing client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]'. Reason(s): Connection reset by peer
[2024-02-16 17:47:18.442] [NOTICE] Client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]' logged in successfully
[2024-02-16 17:47:48.556] [NOTICE] Removing client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]'. Reason(s): epoll says socket is in ERR or HUP state.
[2024-02-16 17:47:49.558] [NOTICE] Client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]' logged in successfully
[2024-02-16 17:48:19.636] [NOTICE] Removing client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]'. Reason(s): epoll says socket is in ERR or HUP state.
[2024-02-16 17:48:20.638] [NOTICE] Client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]' logged in successfully
[2024-02-16 17:48:31.500] [NOTICE] Removing client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]'. Reason(s): epoll says socket is in ERR or HUP state.
[2024-02-16 17:48:49.170] [NOTICE] Client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=1]' logged in successfully
halfgaar commented 4 months ago

Interesting. A keep-alive of 15s is rather short BTW. I don't know if that's on purpose or not.

The interval between connecting and breaking the connection is 30s and 10s. Not constant.

Note: your clientid is too predictable. It opens up the door for somebody knowing it and kicking it out. Or, if you have two clients, they will be kicking each other out. You need to add a random string. Or, since you're using MQTT5, you can leave it empty, and the server will assign one.

skinkie commented 4 months ago

Interesting. A keep-alive of 15s is rather short BTW. I don't know if that's on purpose or not.

I hoped it would get better, not worse. Default in paho is 60.

The interval between connecting and breaking the connection is 30s and 10s. Not constant.

Note: your clientid is too predictable. It opens up the door for somebody knowing it and kicking it out. Or, if you have two clients, they will be kicking each other out. You need to add a random string. Or, since you're using MQTT5, you can leave it empty, and the server will assign one.

This is a single proces with a single client_id. So that was 'kind of' expected.

skinkie commented 4 months ago

@halfgaar can you check my following hypothesis?

  1. connect a client_id "1" (lets same this A)
  2. connect another client with the same client_id "1" (lets name this B)
  3. observe that both A and B get disconnected, while the user would expect that only A (the first) would get disconnected
halfgaar commented 4 months ago

That doesn't happen. You can check that in the logs. The log fragment you posted says (note the reason):

[2024-02-16 17:48:31.500] [NOTICE] Removing client '[ClientID='triggers_portal', username='portal_publisher', fd=29, keepalive=15s, transport='TCP/Non-SSL', address='127.0.0.1', prot=5.0, clean=0]'. Reason(s): epoll says socket is in ERR or HUP state.

If you connect with an existing id, it says:

[2024-02-17 15:14:40.483] [NOTICE] Removing client '[ClientID='myfixedid', username='', fd=26, keepalive=60s, transport='TCP/Non-SSL', address='127.0.0.1', prot=3.1.1, clean=1]'. Reason(s): Another client with this ID connected, Server initiating disconnect with reason code '142'

The fact that the socket is an error state, suggests something is wrong with your connection or client.

skinkie commented 4 months ago

Ok, I switched my code from paho-mqtt to gmqtt, ...and no issues anymore.