vernemq / docker-vernemq

VerneMQ Docker image - Starts the VerneMQ MQTT broker and listens on 1883 and 8080 (for websockets).
https://vernemq.com
Apache License 2.0
178 stars 230 forks source link

Questions when subscribing and expecting to receive many messages #26

Closed EnCl closed 7 years ago

EnCl commented 7 years ago

Environment

VerneMQ Version: 1.1.1 ( ImageID 4317379db50c) OS:Linux Docker container Client: Version: 17.06.1-ce API version: 1.30 Go version: go1.8.3 Git commit: 874a737 Built: Thu Aug 17 22:48:20 2017 OS/Arch: windows/amd64

Server: Version: 17.06.1-ce API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 874a737 Built: Thu Aug 17 22:54:55 2017 OS/Arch: linux/amd64 Experimental: true

Expected behaviour

I have my own HTTP server which is connected to the VerneMQ broker (client A) and uses webhooks to authorize registering, publishing and subscribing to messages. Client A publishes one message when a client connects and another one when disconnects from the broker. I publish messages using my own test program which involves 2000 clients connecting to the broker and publishing messages. I'm using a third party application (client B) to subscribe to some of the messages published.

I expect to see that client B received 4000 messages sent by client A in decent amount of time (maximum 5 minutes).

Actual behaviour I see all the 4000 messages received by client B in aproximately 50 minutes, which is a very long period of time for my purposes.

Details Client A has the keep alive set to 150 and max inflight to 10000.

On the broker I edited the etc/vernemq/vernemq.conf and set (besides everything need to use the webhooks) the following:

_max_inflight_messages = 10000 max_online_messages = 10000 max_offline_messages = 10000 listener.nr_of_acceptors = 10000 erlang.async_threads = 128 leveldb.maximummemory.percent = 80

Sometimes I see in logs from HTTP server the following: org.eclipse.paho.client.mqttv3.internal.ClientState checkForActivity SEVERE: 8c894b2efbe942a7b132116497cd68cc: Timed out as no activity, keepAlive=150,000 lastOutboundActivity=1,503,497,209,568 lastInboundActivity=1,503,497,062,154 time=1,503,497,359,569 lastPing=1,503,497,209,568

Also from testing this multiple times that the broker seems to hang and send the messages from time to time. I can see that the HTTP Server receives the messages and authorizes the publication with periods of complete inactivity. I also encountered in broker logs the following warnings:

_[warning] <0.20438.0>@vmq_mqtt_fsm:connected:420 client {"newmountpoint",<<"8c894b2efbe942a7b132116497cd68cc">>} with username <<"test_auth_00001">> stopped due to keepalive expired [warning] <0.217.0>@plumtree_broadcast:schedule_lazytick:599 Broadcast overloaded, 1177ms mailbox traversal, schedule next lazy broadcast in 58585ms, the min interval is 10000ms

Using in my test program less clients(e.g. 200) client B receives all messages in 1 minute.

I noticed that both when I'm using a small number of clients or a big number of clients I encounter the follwing error: _[error] <0.20443.0>@vmq_webhooks_plugin:call_endpoint:476 calling endpoint failed due to received_payload_notjson

I am sending everything json so I don't understand why I see this error.

I updated today my Docker for Windows to latest version and also VerneMQ to latest version and I don't see any difference for my problem.

At this moment I don't know what is the cause for this hanging. Is something I can change in broker settings or in my connection settings to move faster ?

I appreciate any help here. Thank You!

larshesel commented 7 years ago

I would expect this is due to the inherent latency in calling a web endpoint for every connect, every publish and every subscribe. I'd look into measuring the base-line performance of your endpoints - can they handle that amount of requests and concurrency? If not then your endpoint is a bottleneck slowing down VerneMQ.

One way to speed things up is to use the webhook cache so authentication and authorization doesn't have to go to the endpoint every time.

jeremylink commented 7 years ago

Lars, I'm working on this as well and seeing similar issues (the last issue about received_payload_not_json has been resolved).

Regarding the HTTP auth server - this is able to handle the load for requests very quickly. We can track when request come in and resolved (and responses sent back to broker) - this happens in near real-time. The bottleneck occurs with how fast it can send messages to the broker (single client connection to broker). If QoS 0, then messages get sent to the broker within seconds (even under load). If QoS 1, then there is a significant backlog as the pub must wait for ACK from the broker.

I've also run a different test client that opens N (<1000) connections to the broker and it is sending messages concurrently with the HTTP server. When sent with QoS 0, all messages are immediately sent and then there is noticeable lag (30-60s) for these messages to be received. When sending QoS 1, it takes >60s for all messages to be sent to the broker.

The bottleneck seems to be with the broker and how many messages it can rx/buffer/send.

The H/W does not seem to be the issue since it can handle all the HTTP plugin requests without much issue (due to sub auths, these are effectively > than a 1-1 ratio to packets transmitted).

Regarding caching the authorization - we are doing that, but this test is focusing on what happens in a transition phase, so it does not come into play as much as it will in a steady-state scenario. Even so, I do not see noticeable lag in the HTTP requests/responses.

larshesel commented 7 years ago

Hi - I'm not completely sure I understand your analysis, but as I understand your conclusion you believe there's some kind of bottleneck in VerneMQ.

Are you sure you're using the cache? If all auth_on_publish requests hit the endpoint, then it's not surprising it takes a long time to send a lot of messages as each publish suffers the extra overhead of talking to the endpoint. What does vmq-admin webhooks cache show say?

Anyway, to try to narrow it down where the time is spent you could try and redo your experiments where you disable the webhook plugin, set allow_anonymous to true and enable the vmq_acl plugin (without any ACLs) and disable everything else, then all clients can do anything.

If this is fast and responsive then the problem lies somewhere between the webhook plugin and your endpoints. If not then it's somewhere in the interplay between VerneMQ and the clients.

EnCl commented 7 years ago

I used allow_anonymous = on, disabled webhooks and sent more than 4000 messages. I received the messages with a subscribing client in approximately 1 minute.

jeremylink commented 7 years ago

I would expect that without any authentication we can pub/sub messages at a very high rate (even 4k in 1 minute seems slower than it should). Under steady-state, this would be the normal situation when using the cache.

I'm trying to resolve the corner-case performance for what happens when all of the endpoints come online at the same time and everything needs to be authenticated for the first time. It is this situation where the slowdown occurs - even though the webhook auth response completes very quickly.

larshesel commented 7 years ago

Hi @EnCl and @jeremylink are you working on this project together or are you just experiencing similar issues? As it is I'm now sure if @jeremylink tried to disable the webhooks plugin as I suggested and measured the outputs or if that was only you @EnCl who did that?

larshesel commented 7 years ago

@EnCl what QoS are you using when it takes about 1 minute for 4k messages to be sent? Depending on the QoS the latency between the client and server can play a large role. For QoS2 for instance the client and server need several messages to flow back and forth to complete the delivery and 60s/4000msg ~ 15ms/msg so what you describe may be to expected depending on your network performance.

EnCl commented 7 years ago

Hi Lars, yes my and Jeremy are working together and we are facing the same issues on different instances of Verne and HTTP Server. I used QoS=1 when 4k messages were sent in 1 minute.

What we further noticed is that by disabling some of the webhooks we were able to receive the 4000 messages in a few minutes (less than 10). By enabling back some of the webhoos we were still able to receive the 4000 messages in a few minutes.

The plugins.vmq_acl = on setting seems to help a lot. Still working on this.

larshesel commented 7 years ago

Enabling plugins.vmq_acl basically just means all other authorization hooks are skipped.

It would be good if you could describe a single experiment and we can focus on that an try to get to the bottom of it. Discussing different experiments at the same time makes it rather hard to follow what was going on and establish the facts.

Thank you for your understanding.

jeremylink commented 7 years ago

Lars, we've been testing things quite a bit and pretty definitive use case where things start to fail.

We have a webhooks server supporting these callbacks: auth_on_register (talks to DB for authentication) on_register (queues a status message with an endpoint specific topic) auth_on_subscribe auth_on_publish (returns immediately, caching enabled for 120 seconds)

We have a test app that will:

Webhooks server:

We are also monitoring the VMQ server metrics:

We’ve tested all of the authentication/authorization calls on the webhooks server and these all complete very quickly (even one with DB calls).

Test Case #1:

Test Case #2:

Test Case #3:

Test Case #4:

The obvious fix on our side would be to make all of our status message topics the same and add the unique fields in the message body. In some cases this is doable, but this will not work when sending to directly to unique endpoints.

This problem is also not an issue during steady-state since message topics will get cached after the first time, BUT it is a problem for system/endpoint restart/upgrade scenarios.

dergraf commented 7 years ago

can you show us a plot over time (1s resolution) for the QoS2 test containing the following data?

counter.mqtt_publish_sent = 0 counter.mqtt_pubrec_sent = 0 counter.mqtt_pubrel_sent = 0 counter.mqtt_pubcomp_sent = 0

counter.mqtt_publish_received = 0 counter.mqtt_pubrec_received = 0 counter.mqtt_pubrel_received = 0 counter.mqtt_pubcomp_received = 0

counter.mqtt_pubrec_invalid_error = 0 counter.mqtt_pubcomp_invalid_error = 0 counter.mqtt_publish_error = 0

jeremylink commented 7 years ago

Not sure which test case # you mean (we don't do QoS2).

Also, right now we are not exporting the graphics automatically - just watching them via CLI.

We can run the test and just run the command to refresh them ever few seconds - is that good enough?

dergraf commented 7 years ago

Oops.. ignore my message ;) sorry.

jeremylink commented 7 years ago

Lars, wondering if you had anymore thoughts on this?

larshesel commented 7 years ago

No, sorry - it's a complex setup and it's not anything we've seen before so I believe it's something to do with your particular configuration / plugins / clients / ...

Can you make a small testcase reproducing this odd behaviour when you're not using any webhooks? Have you tried it outside of docker?

jeremylink commented 7 years ago

I don't think it would be valid to test without the webhooks (specifically auth_on_pub) because the problem goes away if:

All of our tests are done using Docker. Do you see performance differences when running bare metal vs. Docker?

Have you ever done load testing with webhooks?

larshesel commented 7 years ago

Ok, it wasn't clear to me from the above what the results were from disabling the webhooks. I don't use docker much myself, it was just a thought. We have done some basic load-testing of the webhooks which is why we introduced the cache as it will always be expensive to go to an external endpoint.

The webhooks is dominated by the time it takes to call your endpoint and internally there is a worker pool for each endpoint with 100 workers, so in principle you should be able to handle 100 concurrent requests per endpoint registered.

Let's try and look closer at one of the cases above:

Test Case #3:

Status topics all unique, status msg QoS 0
All status messages transmit complete in “real time”

What does complete mean here?

Which hooks are involved in this testcase - in other words, which hooks are triggered and go to a webhooks endpoint?

What are the number of publishes and subscribers? Is it a fan-in? fan-out? what's the relationship between publishers and consumers?

Received messages are bursty and subject to delay (minutes)
o E.g. 50 msgs received immediately, 75 more a minute later, 20 more another minute later, etc.
o All messages are typically received eventually but can take anywhere from 10-50 minutes
Completed authorization counts match counters in broker
o E.g. mqtt_publish_received counter continues to increase long after all messages have been sent by the webhooks app

This is weird. The mqtt_publish_received counter is incremented after parsing but before hooks when a message is published from the client to the broker. This means messages are still arriving at the broker and contradicts that all messages have been handled.

dergraf commented 7 years ago

Docker will most of the times be less performant than bare metal when running VerneMQ at scale.

Can you test on bare metal and see if the issue remains?

What is your network configuration for docker, e.g. there is certainly a difference between host abd bridge mode.

jeremylink commented 7 years ago

@larshesel Complete means that the MQTT client has transmitted the messages and has made the deliveryComplete() callback (for QoS 0, this means the message has been delivered to the network interface).

We have webhooks for: auth_on_register, on_register, auth_on_subscribe, auth_on_publish Only the auth_on_publish call seems to cause problems.

We have one test client with 2k connections that each publish a single message. These messages do not seem to have any slowdown, but I will have to verify. We have our Auth Server (handles the webhooks) that has a single connection and also publishes 2k messages. These are the messages that have issues. I would be good to create N connections instead of one and use a thread pool to transmit the messages across multiple connections. @EnCl Is this something you can try today? Add say 30 connections in the Auth Server and see what happens? If the problem is related to the 100 threads per connection, this very well might make the problem go away.

For subscribers, we typically have one or two subscribers on a shared topic, but for our testing, we've been restricting this to a single subscriber.

Regarding the mqtt_publish_received counter, that is surprising if that is how it works, but it somewhat matches what we see - when things get blocked, even the mqtt driver's ping messages get blocked and cause the connection to the broker to timeout (and then everything auto-reconnects).

jeremylink commented 7 years ago

Also, I know in ACL we can use wildcards for both pub and sub authorization, is it possible to use wildcards for webhook authorization? Is it possible to add entries to the ACL file through the management interface on 8888?

If not, I can probably come up with a way to add DB authorization for the message pub.

jeremylink commented 7 years ago

@dergraf Regarding Docker vs. bare metal: many of the developers are on Windows, and so it is much easier to deploy the broker via Docker when testing. For production, we almost always deploy on Docker containers. I've monitored the performance and not seen any issues, but depending on how other testing goes, I can always change our EC2 hosts from small to XL.

For our testing, we're just using bridged mode. For our multi-node environment, we will use Host mode.

larshesel commented 7 years ago

Hi @jeremylink

We have webhooks for: auth_on_register, on_register, auth_on_subscribe, auth_on_publish Only the auth_on_publish call seems to cause problems.

We have one test client with 2k connections that each publish a single message. These messages do not seem to have any slowdown, but I will have to verify.

So you have a single application which connects as 2000 separate clients each publishing a single message? Are these message published concurrently?

We have our Auth Server (handles the webhooks) that has a single connection and also publishes 2k messages. These are the messages that have issues.

I'm not sure how to understand that. So the auth server serves the webhook endpoints but is also publishing messages as a single client? If that is the case we should focus on this and remove the other test app with the 2000 clients from the picture.

Are all of these 2k messages published via single client to different topics (is the webhook cache used)? What QoS level are they?

I would be good to create N connections instead of one and use a thread pool to transmit the messages across multiple connections. @EnCl Is this something you can try today? Add say 30 connections in the Auth Server and see what happens? If the problem is related to the 100 threads per connection, this very well might make the problem go away.

EnCl commented 7 years ago

Hi Lars,

We have a test application which launches 2000 clients. All these clients connect to VerneMQ and publish one test message.

The Auth Server which handles the webhooks is connected to the VerneMQ broker using one MQTT client. When the auth_on_register is called and a client is authenticated it will publish a status message.

So we will have 2000 test messages published by 2000 clients in the test application and 2000 status messages published by the single client from Auth Server. The status messages are published to unique topics.

We have another 3rd pary application which we use to subscribe to status messages. We receive all these status messages after a big delay, but if we publish the status messages to same topic and we enabled the cache for publish then the delay goes away. This behaviour is seen for Qos=1 and Qos=0. Same behavior was observed even if we wrote our own application to subscribe to status messages. We also subscribed to the other 2000 test messages and we receive them much faster.

I did what Jeremy suggested and used 10 Mqtt clients in the Http Server to publish the status messages. The delay is now reduced to 5 minutes.

larshesel commented 7 years ago

It's clear then what the problem is then - each publish from a single goes through a webhook which always has to call out to the endpoint and await a response - and all publishes are done one after another as they're coming from a single client.

jeremylink commented 7 years ago

@larshesel I agree that the problem can be mitigated by increasing the number of client connections, but I wouldn't say that solves the problem.

Let's say I transmit 100 packets and it takes <10s to receive all of them at the subscriber. Now, if I increase that to 1k packets, I would expect it to take <100s, but instead it might take 1500s. So what we see is a "tipping" point, and beyond that the delay is exponential instead of linear.

So I can add additional client connections and this will increase the ability to handle more packets, but at some load point, we're going to run into the same problem, and with the delay being exponential, it is very bad.

I posted something similar to this on Slack back on July 31st. This was a time when I understood things much less than I do now, but I saw similar behavior. Up to 100 packets, everything was great, but once I crossed that threshold, everything slowed way down. I wasn't even getting the first 100 packets coming through - it was much less than that.

Does it make sense for me to put together a simple test app in Python/Java that will just authorize published messages? Do you need something like this to allow you to see the problem on your side?

jeremylink commented 7 years ago

One more question: you said there are 100 threads available to process messages from each client? So if I have 10k endpoints connected to the broker, there would be 1million threads to handle these? That seems like an insanely big number...

larshesel commented 7 years ago

If you're using 10k endpoints then you're using webhooks for something it wasn't designed for, or at least a use case which was not in our minds when designing it. Normally you would only register a couple of handful of endpoints.

Trying to understand and support your use case here is not the scope of our github issues, but rather for reporting bugs and deficiencies. I would suggest you get in touch with us for commercial support so we can get to understand your project and help you in a more constructive way.