MQTT support should be horizontal scalable

ianzhang1988 commented 2 years ago

Feature Request

nats-server: v2.8.0

When I benchmark MQTT protocal, I found that MQTT support is supported with 3 stream with replica set to 3:

│ $MQTT_rmsgs │             │ 2022-05-10 09:05:33 │ 0        │ 0 B     │ never        │
│ $MQTT_sess  │             │ 2022-05-10 09:05:28 │ 28,348   │ 6.1 MiB │ 6m50s        │
│ $MQTT_msgs  │             │ 2022-05-10 09:05:32 │ 561,704  │ 176 MiB │ 2h45m31s     │

nats stream info '$MQTT_msgs' output:

Configuration:

             Subjects: $MQTT.msgs.>
     Acknowledgements: true
            Retention: File - Interest
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats
               Leader: nats-5
              Replica: nats-3, current, seen 0.05s ago
              Replica: nats-4, current, seen 0.05s ago

State:

             Messages: 561,704
                Bytes: 176 MiB
             FirstSeq: 1 @ 2022-05-11T04:45:29 UTC
              LastSeq: 960,236 @ 2022-05-11T06:22:32 UTC
     Deleted Messages: 398532
     Active Consumers: 9419

As I understand, this could only utilize 3 server.

Use Case:

Where MQTT is used.

Proposed Change:

Enable horizontal scale.

Who Benefits From The Change(s)?

Everyone use MQTT.

Alternative Approaches

kozlovic commented 2 years ago

@ianzhang1988 Correct, currently you cannot set the number of replicas, instead, the server determines the number based on the size of the cluster (number of routes) but cap it at 3.

We are planning at adding a Replicas option in the MQTT configuration block.

How many replicas are you planning on running though? Also, I noticed that you are having quite a lot of consumers. NATS currently is a bit limited in horizontal scalability because of the number of RAFT groups (one per asset - an asset being a stream or a consumer). So the more consumers you have, the more basic RAFT traffic will increase, potentially more than actual data flow. I would recommend that you look at using wildcards wherever possible. For instance (and I understand that this may not be applicable) instead of consuming on foo/bar and foo/baz you could consume on foo/+ and then have the consumer process messages based on the subject.

ianzhang1988 commented 2 years ago

It's best if we could use all server in a cluster, so that we could meet new demand. Is it possible for mqtt use multiple stream across all server, and add new stream when scale up?

About wildcards, in the scenario of broadcast say we have a lot IoT devices that subscribe some info from cloud. Can they use the same consumer, or one consumer for one device?

kozlovic commented 2 years ago

It's best if we could use all server in a cluster, so that we could meet new demand. Is it possible for mqtt use multiple stream across all server, and add new stream when scale up?

Unfortunately no. MQTT uses a single stream, called $MQTT.msgs to store all MQTT messages. It collect subjects on $MQTT.msgs.>. Incoming MQTT messages when processed and transformed into NATS messages are added this prefix so that they are stored in this stream. But they are also processed without the prefix, which allows them to be processed as NATS message on the "original" MQTT topic, which means that technically they could also be stored in separate JetStream streams that collect the equivalent subject.

For instance, an MQTT publisher publishes on topic foo/bar, this is transformed to foo.bar and processed both as foo.bar and $MQTT.msg.foo.bar, which means that it will be stored in the MQTT stream $MQTT.msg stream, but if you also have a JetStream stream called MY_STREAM that collects subjects on foo.> (for instance) then this message will also be stored in that stream. But the problem is that in order to consumer from an MQTT subscriber, then it will consume only from the MQTT's stream, not MY_STREAM. So you can "duplicate" MQTT messages into JetStream and have them consumed with NATS JetStream consumers, but not by MQTT consumers.

About wildcards, in the scenario of broadcast say we have a lot IoT devices that subscribe some info from cloud. Can they use the same consumer, or one consumer for one device?

What I was referring to is if there is options for you to reduce the number of consumers applications. That is, instead of having 10 applications each creating its own MQTT subscription with a token in the topic that indicate that it is for them, to have a single application that subscribes with a wildcard (on that token) and then does specific processing based on the value of that token. But I understand that this is not applicable in many cases (such as a device that needs to consume data directly).

ianzhang1988 commented 2 years ago

Thanks for the clarification.

Would it be possiable to improve this if we add prefixs like $MQTT.msgs.1 ... $MQTT.msgs.n for MQTT messages, and create correspond streams that collect on $MQTT.msgs.1.> ... $MQTT.msgs.n.> .

Then MQTT publish messages to streams in a round robin style, and consumer for MQTT subscriber consume all those streams.

kozlovic commented 2 years ago

@ianzhang1988 Unfortunately, this is not possible: consumers cannot span multiple streams.

ianzhang1988 commented 2 years ago

I see.

What's the plan for MQTT support in the future? Is MQTT support a priority or a nice to have feature.

Thanks again.

kozlovic commented 2 years ago

Is MQTT support a priority or a nice to have feature.

That's a good question. We see more and more interest actually for MQTT at high scale, and we will need to see if we can make this better. We are introducing a way to configure (globally) the stream replicas number, and possibly lower the consumer replicas count (or even make it a memory-only consumer). But still. I think the original intent was more of having a way to integrate MQTT with NATS, so I was thinking more of MQTT publishers but then consumers are NATS, or vice-versa, but it does not seem to be the case.

ianzhang1988 commented 2 years ago

MQTT is popular in edge computing. We already have some project using MQTT(EMQX) for communication between several hundreds of thousands devices. Alas, we know little about Erlang, so we have trouble to understand some behavior of EMQX. That's why I look into NATS trying to replace EMQX, and keep all clients compatible.

What's better is that, by using NATS we can create a stream to collect data from MQTT topic for later analysis, instead of subsicribe data to kafka, that's two birds with one stone.

So, I hope MQTT could be a important part in the future plan.

And, thanks for your patience @kozlovic

derekcollison commented 2 years ago

It is important, and Edge is a key area for us. Thanks for the feedback, very helpful.

kozlovic commented 2 years ago

@ianzhang1988 As a step to help scalability, we have merged PR #3130 that would allow the configuration of the consumers as a lower replicas than the parent stream. So you could have a cluster of 5 and configure the stream to be R=5 (with mqtt{stream_replicas: 5}) but make the consumer potentially be a R=1 (or more of course) with mqtt{consumer_replicas: 1}.

ianzhang1988 commented 2 years ago

mqtt {
  stream_replicas: 5
  consumer_replicas: 1
}

With above configration, I see improved benchmark, And cluster is more stable. Thanks for the quick respond.

I think consumer_replicas: 1 is the key factor, consumsers would use less disk IO I assume. And cluster would not stuck at RAFT sync ( run benchmark sub more than 10k connection with v2.8.2, server print lots of errors of consumer about RAFT, and cluster would no longer work).

Is stream_replicas: 5 means that consumers can read from stream replicas evenly? I thought leader do all read and write.

FYI, I use faryne/emqtt-bench-tool from docker hub as my benchmark tool, with qos set to 1.

kozlovic commented 2 years ago

Is stream_replicas: 5 means that consumers can read from stream replicas evenly? I thought leader do all read and write.

There is a leader for the stream, and there is a leader for a consumer, and they don't have to be the same. So with streams R=5, and consumers R=1, you reduce the use of RAFT and also distribute the consumers work load "evenly" on the 5 servers.

ianzhang1988 commented 2 years ago

I notice that after some run of benchmark tool, lots of consumers left on cluster. And there is performance downgrage for consumer. When I pub 8k/s msgs, I could only sub 1k/s with 1 consumer. After I use nats consumer rm delete all consumers, Now pub and sub could all be at 8k/s.

Also, nats consumer ls show repeat consumer name.

Any idea why?

Those consumers would continue to receive message from steram, right? After accumulating this could be bad. Is there a way to prevent this?

And how can I purge these consumer with out delete the ones active?

kozlovic commented 2 years ago

Also, nats consumer ls show repeat consumer name. Any idea why?

No, could you post an excerpt of the output? Normally these names are randomly generated and correspond to a QoS 1 subscription.

Those consumers would continue to receive message from steram, right? After accumulating this could be bad. Is there a way to prevent this?

They will stay if they have been created by a session that has the "clean session" flag set to false, that is, ask to be persisted. If you look at the $MQTT_sess stream, how many messages does that report? That would be equivalent to the number of sessions either active or that have been persisted because of session "clean session" flag being false.

And how can I purge these consumer with out delete the ones active?

If it is because of the session being left, then you would need to restart the session with the same client ID but use the "clean session" flag set to true, then when this session is closed, it will remove the session and all its QoS1 subscriptions.

ianzhang1988 commented 2 years ago

They will stay if they have been created by a session that has the "clean session" flag set to false

My benchmark tool (faryne/emqtt-bench-tool) do not have a option for clean session, so I forced it to be true by changing the code at mqtt.go aroud line 2700:

    // Is the client requesting a clean session or not.
    // cleanSess := cp.flags&mqttConnFlagCleanSession != 0
    cleanSess := true

But still some consumers remained after benchmark tool exit.

My code is at bb9e94220893983758276d044d7d31e83562212e, compiled with go 1.16.3 Cluster with 5 nats server on kubernetes, cephfs as storage.

I can reproduce follow these steps:

pub 10msg/s qos 1 ( ./emqtt_bench pub -S -s 1 -t test -V 4 -q 1 -I 1000 -c 10 )
sub with 5000 connections qos 1 ( ./emqtt_bench sub -S -t test -V 4 -i 1 -q 1 -c 5000 )
Ctrl+C close sub benchmark. ( now there are consumers left in cluser )
repeat step 2 and 3 ( now I see repeat consumer id )

./nats -s foo --tlsca=ca.crt consumer ls \$MQTT_msgs | head -30

Consumers for Stream $MQTT_msgs:

        01eY5YtI_HKMJnXHHpjaa4d426l4Sra
        01eY5YtI_HKMJnXHHpjaa4d426l4Sra
        01v2kFuK_FvGniUvpHrqgfI6SHTowKb
        01v2kFuK_FvGniUvpHrqgfI6SHTowKb
        03fU3av9_HKMJnXHHpjaa4d426kyE9C
        03fU3av9_HKMJnXHHpjaa4d426kyE9C
        048LwLNh_FvGniUvpHrqgfI6SHTjwYx
        048LwLNh_FvGniUvpHrqgfI6SHTjwYx
        04EIwDaU_q3jSeOpuzcJv5J1JfDRGxO
        04EIwDaU_q3jSeOpuzcJv5J1JfDRGxO
...

./nats -s foo --tlsca=ca.crt stream report

Obtaining Stream stats

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                         Stream Report                                                         │
├─────────────┬─────────┬───────────┬───────────┬──────────┬─────────┬──────┬─────────┬─────────────────────────────────────────┤
│ Stream      │ Storage │ Placement │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                                │
├─────────────┼─────────┼───────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────────────────────────────┤
│ $MQTT_rmsgs │ File    │           │ 5         │ 0        │ 0 B     │ 0    │ 0       │ nats-0, nats-1, nats-2, nats-3*, nats-4 │
│ $MQTT_msgs  │ File    │           │ 1660      │ 13,921   │ 1.0 MiB │ 0    │ 2633718 │ nats-0, nats-1*, nats-2, nats-3, nats-4 │
│ $MQTT_sess  │ File    │           │ 0         │ 6,858    │ 3.3 MiB │ 0    │ 75255   │ nats-0, nats-1, nats-2, nats-3, nats-4* │
╰─────────────┴─────────┴───────────┴───────────┴──────────┴─────────┴──────┴─────────┴─────────────────────────────────────────╯

ianzhang1988 commented 2 years ago

log from server nats-4:


[6] 2022/05/21 08:58:18.234851 [ERR] 10.62.243.140:37670 - mid:31898 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:58:28.546323 [ERR] 10.62.243.140:7328 - cid:31899 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:58:28.942764 [ERR] 10.62.255.152:38104 - cid:31900 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:58:32.924297 [ERR] 10.62.243.140:52772 - mid:31901 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:58:36.984491 [ERR] 10.63.124.26:18235 - cid:31902 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 08:58:49.134761 [ERR] 10.63.124.26:18275 - cid:31903 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 08:58:49.599045 [ERR] 10.62.243.140:24942 - cid:31904 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:01.186057 [ERR] 10.63.124.26:18329 - cid:31905 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 08:59:20.090921 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > mQDW6FXV_qxQ3p0FT424ARtoQSEUM2T'
[6] 2022/05/21 08:59:20.099374 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > CftRhg4D_FvGniUvpHrqgfI6SHTjK4f'
[6] 2022/05/21 08:59:20.100749 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > HFDBsdMZ_HKMJnXHHpjaa4d426kxEOo'
[6] 2022/05/21 08:59:20.101743 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Q1fmpN6c_4LrfXZGuGGMxyrKvHY0pC2'
[6] 2022/05/21 08:59:20.102166 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > G6zM3OYn_HKMJnXHHpjaa4d426kxEFu'
[6] 2022/05/21 08:59:20.102460 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 0MUtZFD1_HKMJnXHHpjaa4d426kxEpW'
[6] 2022/05/21 08:59:20.102901 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > JFByYT35_4LrfXZGuGGMxyrKvHY0pYS'

...

[6] 2022/05/21 08:59:33.628236 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > yrhDdoxa_4LrfXZGuGGMxyrKvHY364c'
[6] 2022/05/21 08:59:33.628599 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > TUDpEo1E_qxQ3p0FT424ARtoQSFSPx3'
[6] 2022/05/21 08:59:33.629005 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Hvc7Xs3R_qxQ3p0FT424ARtoQSFSQkj'
[6] 2022/05/21 08:59:33.629528 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > vuQEvW83_q3jSeOpuzcJv5J1JfDVnkM'
[6] 2022/05/21 08:59:33.629961 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > FayBpbAE_FvGniUvpHrqgfI6SHTnalx'
[6] 2022/05/21 08:59:33.636441 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > dwHdTuYr_FvGniUvpHrqgfI6SHTnbUJ'
[6] 2022/05/21 08:59:33.642071 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > LjPLyvdK_qxQ3p0FT424ARtoQSFShZx'
[6] 2022/05/21 08:59:33.675022 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > rOF3B6C2_HKMJnXHHpjaa4d426l1eX4'
[6] 2022/05/21 08:59:33.677476 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Tiuoe6NH_HKMJnXHHpjaa4d426l1efy'
[6] 2022/05/21 08:59:33.684603 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > ugQ6anBW_4LrfXZGuGGMxyrKvHY37NK'
[6] 2022/05/21 08:59:33.695819 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > n5ebzlbS_HKMJnXHHpjaa4d426l1exm'
[6] 2022/05/21 08:59:33.711352 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > hOhDE05j_FvGniUvpHrqgfI6SHTne3z'
[6] 2022/05/21 08:59:33.717322 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Cczhhn73_4LrfXZGuGGMxyrKvHY37si'
[6] 2022/05/21 08:59:33.723851 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > UmVIh8vr_qxQ3p0FT424ARtoQSFTCtt'
[6] 2022/05/21 08:59:33.724542 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > f3RArRZa_FvGniUvpHrqgfI6SHTneCr'
[6] 2022/05/21 08:59:33.746156 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 7cJq09RJ_q3jSeOpuzcJv5J1JfDVrJS'
[6] 2022/05/21 08:59:33.755541 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 29XgArOa_HKMJnXHHpjaa4d426l1ggc'
[6] 2022/05/21 08:59:33.759659 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > bsp03Mq4_qxQ3p0FT424ARtoQSFTQ0n'
[6] 2022/05/21 08:59:33.763379 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > MmFHowV0_qxQ3p0FT424ARtoQSFTRc9'
[6] 2022/05/21 08:59:33.769120 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 9ePzAxsi_HKMJnXHHpjaa4d426l1gyQ'
[6] 2022/05/21 08:59:33.795661 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > BqzXSz2j_qxQ3p0FT424ARtoQSFTfEf'
[6] 2022/05/21 08:59:33.814676 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > jPO131wr_4LrfXZGuGGMxyrKvHY39KO'
[6] 2022/05/21 08:59:33.820759 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > GgkZxDxV_q3jSeOpuzcJv5J1JfDVtnG'
[6] 2022/05/21 08:59:33.828181 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > AhgPjUXX_HKMJnXHHpjaa4d426l1ihG'
[6] 2022/05/21 08:59:33.839794 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > yx9t5M5m_qxQ3p0FT424ARtoQSFTw8P'
[6] 2022/05/21 08:59:33.842980 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 2F2NafGx_q3jSeOpuzcJv5J1JfDVuPi'
[6] 2022/05/21 08:59:33.851382 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > phf23KXQ_FvGniUvpHrqgfI6SHTnhdl'
[6] 2022/05/21 08:59:33.857599 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > RhcXM8BO_4LrfXZGuGGMxyrKvHY39yk'
[6] 2022/05/21 08:59:33.871375 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > dJhX3Ov5_FvGniUvpHrqgfI6SHTni4N'
[6] 2022/05/21 08:59:33.884240 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > F3sQwgOe_qxQ3p0FT424ARtoQSFUE7t'
[6] 2022/05/21 08:59:33.912738 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > fx0Oenn4_q3jSeOpuzcJv5J1JfDVwju'
[6] 2022/05/21 08:59:33.941137 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > yaUUhE2z_q3jSeOpuzcJv5J1JfDVxMM'
[6] 2022/05/21 08:59:34.534377 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > WS94OYGE_qxQ3p0FT424ARtoQSFUgpF'
[6] 2022/05/21 08:59:34.558186 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > pGySRkoW_4LrfXZGuGGMxyrKvHY3Bds'
[6] 2022/05/21 08:59:34.558556 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > RPowTlcs_q3jSeOpuzcJv5J1JfDVy8Q'
[6] 2022/05/21 08:59:34.559280 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > G7BR4BA3_FvGniUvpHrqgfI6SHTnkVB'
[6] 2022/05/21 08:59:34.560068 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > RMngIO3G_q3jSeOpuzcJv5J1JfDVybG'
[6] 2022/05/21 08:59:34.560653 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > XdQgvY6G_HKMJnXHHpjaa4d426l1n9G'
[6] 2022/05/21 08:59:34.561096 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > sgdwL1NA_4LrfXZGuGGMxyrKvHY3CTR'
[6] 2022/05/21 08:59:34.561526 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > JXM7ta5a_HKMJnXHHpjaa4d426l1nrm'
[6] 2022/05/21 08:59:34.562099 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Kp3Pzs5y_4LrfXZGuGGMxyrKvHY3Cee'
[6] 2022/05/21 08:59:34.576652 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > LDPtA89i_q3jSeOpuzcJv5J1JfDW154'
[6] 2022/05/21 08:59:40.106968 [ERR] 10.62.243.140:41320 - cid:34938 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:41.266728 [ERR] 10.62.242.216:50344 - mid:34939 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:44.584750 [ERR] 10.62.243.140:20867 - mid:34940 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:49.620124 [ERR] 10.63.124.26:18495 - cid:34941 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 08:59:54.599917 [ERR] 10.62.242.216:57210 - mid:34942 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:54.810242 [ERR] 10.62.242.216:21218 - cid:34943 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 08:59:58.852629 [ERR] 10.62.243.140:58602 - cid:34945 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:00:04.864256 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.864267 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.864277 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.864445 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.864471 [WRN] JetStream API limit exceeded: 4097 calls outstanding

...

[6] 2022/05/21 09:00:04.907772 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.907855 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.907891 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.907921 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.908052 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:04.908075 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:00:07.002144 [INF] 10.49.88.250:17068 - rid:34946 - Route connection created
[6] 2022/05/21 09:00:07.002213 [INF] 10.49.88.250:17068 - rid:34946 - Router connection closed: Client Closed
[6] 2022/05/21 09:00:07.055144 [INF] 10.128.216.45:51088 - rid:34947 - Route connection created
[6] 2022/05/21 09:00:07.055172 [INF] 10.128.216.45:51088 - rid:34947 - Router connection closed: Client Closed
[6] 2022/05/21 09:00:07.837287 [ERR] 10.63.124.26:18567 - cid:34948 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:00:11.092017 [ERR] 39.156.42.38:33114 - mid:34949 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:00:20.615856 [ERR] 10.62.243.140:41155 - mid:34951 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:00:32.100548 [ERR] 10.63.124.26:18657 - cid:34952 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:00:38.586031 [ERR] 10.62.242.216:5595 - mid:34953 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:00:39.079025 [ERR] 10.62.243.140:10621 - cid:34954 - TLS handshake error: tls: first record does not look like a TLS handshake

...

[6] 2022/05/21 09:02:28.849660 [ERR] 10.62.255.152:40287 - websocket handshake error: invalid value for header 'Upgrade'
[6] 2022/05/21 09:02:33.044023 [ERR] 10.63.124.26:19093 - cid:34979 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:02:40.880265 [ERR] 10.62.243.140:8970 - cid:34980 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:02:42.733671 [ERR] 10.62.242.216:40959 - mid:34981 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:02:45.937987 [ERR] 10.62.243.140:52914 - mid:34982 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:02:55.330441 [ERR] 10.62.242.216:47771 - mid:34984 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:02:55.552183 [ERR] 10.62.242.216:11787 - cid:34985 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:02:59.963650 [ERR] 10.62.243.140:26594 - cid:34986 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:03:09.283317 [ERR] 39.156.42.38:31164 - cid:34988 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:03:09.321025 [ERR] 10.63.124.26:19229 - cid:34989 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:03:20.227240 [ERR] 10.62.243.140:8881 - mid:34990 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:03:21.408135 [ERR] 10.63.124.26:19271 - cid:34991 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:03:37.502687 [ERR] 10.62.242.216:60205 - mid:34992 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:03:39.927987 [ERR] 10.62.243.140:42598 - cid:34993 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:03:43.110422 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > ZPzkHNrZ_FvGniUvpHrqgfI6SHTofjZ'
[6] 2022/05/21 09:03:43.111290 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > FZjYO4ix_q3jSeOpuzcJv5J1JfDX1nU'
[6] 2022/05/21 09:03:43.113339 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > ptkxnVJ1_q3jSeOpuzcJv5J1JfDX1ds'
[6] 2022/05/21 09:03:43.114816 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > O2oCRGy8_FvGniUvpHrqgfI6SHTofsR'
[6] 2022/05/21 09:03:43.115414 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > HtaELhmr_qxQ3p0FT424ARtoQSFdoDR'
[6] 2022/05/21 09:03:43.116090 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 0340vG2W_qxQ3p0FT424ARtoQSFdoVV'

...

[6] 2022/05/21 09:03:57.692167 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > nVLnc96B_HKMJnXHHpjaa4d426l7JUY'
[6] 2022/05/21 09:03:57.697815 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > fy5ugJGY_qxQ3p0FT424ARtoQSGdQvx'
[6] 2022/05/21 09:03:57.725742 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > MOKtlPI4_q3jSeOpuzcJv5J1JfDc17E'
[6] 2022/05/21 09:03:57.734226 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 706aSoQB_HKMJnXHHpjaa4d426l7Kdm'
[6] 2022/05/21 09:03:57.744042 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 5j0qN1pv_4LrfXZGuGGMxyrKvHY5yWZ'
[6] 2022/05/21 09:03:57.758333 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > uesTSxp6_4LrfXZGuGGMxyrKvHY5yk1'
[6] 2022/05/21 09:03:57.762482 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > H9FVrRMF_qxQ3p0FT424ARtoQSGdqV7'
[6] 2022/05/21 09:03:57.770484 [ERR] 10.63.124.26:19381 - cid:37881 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:03:57.777856 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > GkMq2ien_q3jSeOpuzcJv5J1JfDc2ya'
[6] 2022/05/21 09:03:57.784699 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 6FkyMLrK_HKMJnXHHpjaa4d426l7MDi'
[6] 2022/05/21 09:03:57.800728 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > GWFSSSoh_HKMJnXHHpjaa4d426l7MnK'
[6] 2022/05/21 09:03:57.815814 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 95CtTC97_FvGniUvpHrqgfI6SHTtJbx'
[6] 2022/05/21 09:03:57.818894 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > Qx8syOIH_4LrfXZGuGGMxyrKvHY5zSr'
[6] 2022/05/21 09:03:57.838610 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 9Ftc0zzh_qxQ3p0FT424ARtoQSGeIKH'
[6] 2022/05/21 09:03:57.843741 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > ZbJusFDD_FvGniUvpHrqgfI6SHTtKKJ'
[6] 2022/05/21 09:03:57.849225 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > gIyJujvi_FvGniUvpHrqgfI6SHTtKTB'
[6] 2022/05/21 09:03:57.853839 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > J6m7ryRW_qxQ3p0FT424ARtoQSGeOaf'
[6] 2022/05/21 09:03:57.856910 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > t01i0sQR_q3jSeOpuzcJv5J1JfDc59A'
[6] 2022/05/21 09:03:57.867220 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > uXCtfyAK_FvGniUvpHrqgfI6SHTtLKP'
[6] 2022/05/21 09:03:57.882815 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > 3XCsg0uC_4LrfXZGuGGMxyrKvHY60Td'
[6] 2022/05/21 09:03:57.894750 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > nSjbefko_HKMJnXHHpjaa4d426l7PNa'
[6] 2022/05/21 09:03:57.898461 [INF] JetStream cluster new consumer leader for 'ipes > $MQTT_msgs > eePSwkWJ_4LrfXZGuGGMxyrKvHY60h5'
[6] 2022/05/21 09:04:03.916223 [ERR] 124.64.198.122:3652 - cid:37922 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:19.023801 [ERR] 10.62.243.140:33067 - mid:37923 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:28.150179 [ERR] 10.62.242.216:12772 - mid:37925 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:28.830600 [ERR] 10.62.255.152:59312 - cid:37926 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:28.835800 [INF] 10.62.255.152:40622 - rid:37927 - Route connection created
[6] 2022/05/21 09:04:28.835829 [ERR] 10.62.255.152:40622 - rid:37927 - Router parser ERROR, state=0, i=0: proto='"GET / HTTP/1.1\r\nUser-Agent: curl"...'
[6] 2022/05/21 09:04:28.835853 [INF] 10.62.255.152:40622 - rid:37927 - Router connection closed: Protocol Violation
[6] 2022/05/21 09:04:29.017381 [ERR] 10.62.243.140:2849 - cid:37928 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:33.039036 [ERR] 10.62.243.140:47847 - mid:37929 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:36.554589 [ERR] 10.62.242.216:19272 - mid:37930 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:36.751296 [ERR] 10.62.242.216:47249 - cid:37931 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:40.372249 [ERR] 10.63.124.26:19537 - cid:37932 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:04:45.544457 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.544457 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.544502 [WRN] JetStream API limit exceeded: 4097 calls outstanding

...

[6] 2022/05/21 09:04:45.594061 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.594104 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.597105 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.597143 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:45.597241 [WRN] JetStream API limit exceeded: 4097 calls outstanding
[6] 2022/05/21 09:04:46.155419 [ERR] 10.20.50.29:19849 - mid:37933 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:04:50.009527 [ERR] 10.62.243.140:20107 - cid:37934 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:05:04.128067 [ERR] 124.64.198.122:43001 - mid:37936 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:05:10.667681 [ERR] 39.156.42.38:53280 - mid:37938 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:05:22.205436 [ERR] 10.62.243.140:1792 - mid:37939 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:05:28.897155 [ERR] 10.63.124.26:19713 - cid:37940 - TLS handshake error: remote error: tls: bad certificate
[6] 2022/05/21 09:05:30.334128 [ERR] 10.62.255.152:51659 - websocket handshake error: invalid value for header 'Upgrade'
[6] 2022/05/21 09:05:40.311147 [ERR] 10.62.243.140:35347 - cid:37941 - TLS handshake error: tls: first record does not look like a TLS handshake
[6] 2022/05/21 09:05:42.053662 [ERR] 10.62.242.216:31714 - mid:37942 - TLS handshake error: tls: first record does not look like a TLS handshake

ianzhang1988 commented 2 years ago

@kozlovic I test the code 7bdd6799b71d296b123d37209a41d566a1b49b52, I see no consumer left in cluster using same steps.

Then, I start a benchmark sub with 5000 connection from another server (5000X2 connections), at which point cluster seems at it's limit, sub msgs number is less than expected. After I quit the benchmark then I see consumers left in cluster again.

And thanks for the quick fix.

Another thing, In the past one week I have encontered maybe 3 times that some consumer lost quorum, and server would quit after some time. I don't know how I got into this. Restarting server would not fix it. Only after I wipe all /data/jetstream and restart cluster then it goes back to normal.

So it's relatively easy to "break" the cluster with lots of connections, Any idea why? And is there a way that I can mediate this.

kozlovic commented 2 years ago

@ianzhang1988 We still have work to do on JetStream for sure (MQTT depends on it). Could you share your configuration file? I assumed that you were using consumer_replicas: 1, but maybe that is not the case? I don't think we should get "NO QUORUM" for R1 consumers?

ianzhang1988 commented 2 years ago

@kozlovic I was using consumer_replicas: 1, I think those consumers were not from MQTT benchmark. As I remember their name were shorter without '_'. I was busy testing MQTT so I just did a wipe when it happened, I don't have much infomation at this point. If happen again, I will gather some data and post here.

Is there document about how leaf node work for MQTT? I tried it, with every leaf node has it's own domain, I see there are MQTT streams in each leaf node.

I assumed this is how leaf node wrok, correct me if I'm wrong:

MQTT client pub a msg to one leaf node, translate into NATS subject, then propogate in NATS Core through leaf port. ( If leaf port temporarily not available, is the msg lost for other node? )
All MQTT streams from leaf nodes get this msg and write it to disk.
MQTT client sub from one leaf node, have a consumer created on that node, which pass the msg to MQTT clent.

It seems a way to improve scalability ( for sub ). My intention is to create one cluster in cloud(k8s) for all MQTT connections. Is it a good idea to deploy leaf node alongside NATS cluster in cloud?

kozlovic commented 2 years ago

Regarding your questions, it is a "yes" for all of them, including the fact that if a leafnode connection is temporarily down, then messages will not reach the cluster indeed.

ianzhang1988 commented 2 years ago

@kozlovic Hi, My cluster "break" again. I have gathered some infomation. Hope it will help.

Setup

nats-server:2.8.4 5 server in cluster, with mqtt configed as

mqtt {
  stream_replicas: 5
  consumer_replicas: 1
}

There are about 1200 clients using github.com/eclipse/paho.mqtt.golang v1.1.2 with clean session set to true. These clinets are not from benchmark tool, thay are real edge devices. Less than 10 msgs/s.

Here was what happened

I found that clients did not got msgs.

I checked the log from server, and saw some error :

[6] 2022/06/07 07:59:21.381534 [WRN] 120.34.197.49:56701 - mid:543295 - "20e380b9bc77b2c39dd38147a2f47f462fbdeb42c7ee8a84_1654584177689" - Readloop processing time: 4.006598066s
[6] 2022/06/07 07:59:21.381547 [ERR] 120.34.197.49:56701 - mid:543295 - "20e380b9bc77b2c39dd38147a2f47f462fbdeb42c7ee8a84_1654584177689" - unable to connect: unable to delete consumer "tb4tNSMB_axlHjZL133OARG38N8BCck" for session "20e380b9bc77b2c39dd38147a2f47f462fbdeb42c7ee8a84_1654584177689": timeout for request type "CD" on "$JS.API.CONSUMER.DELETE.$MQTT_msgs.tb4tNSMB_axlHjZL133OARG38N8BCck" (reply="$MQTT.JSA.S1Nunr6R.CD.ifDnX4Jsd8fF2XwXcNfoO6")
[6] 2022/06/07 07:59:21.638836 [ERR] 220.160.206.226:17595 - mid:543296 - "20e380b9bc77b2c39dd38147a2a27f4642f96d923f7587f0_1654581816753" - Unable to add JetStream consumer for subscription on "20e380b9bc77b2c39dd38147a2a27f4642f96d923f7587f0.recv": err=timeout for request type "CC" on "$JS.API.CONSUMER.DURABLE.CREATE.$MQTT_msgs.iJUry67B_5EYVLP7pyRWIdJvZz61oCi" (reply="$MQTT.JSA.S1Nunr6R.CC.ifDnX4Jsd8fF2XwXcNfole")
[6] 2022/06/07 07:59:21.638878 [WRN] 220.160.206.226:17595 - mid:543296 - "20e380b9bc77b2c39dd38147a2a27f4642f96d923f7587f0_1654581816753" - Readloop processing time: 4.001076783s

consumer ls, found two.

./nats consumer ls \$MQTT_msgs | grep tb4tNSMB_axlHjZL133OARG38N8BCck
        tb4tNSMB_axlHjZL133OARG38N8BCck
        tb4tNSMB_axlHjZL133OARG38N8BCck

info consumer

./nats consumer info \$MQTT_msgs tb4tNSMB_axlHjZL133OARG38N8BCck
nats: error: could not select Stream: context deadline exceeded

server ls, seems cluster not connected.

./nats  server ls
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                      Server Overview                                                       │
├────────┬────────────┬───────────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬──────┬────────────┬─────┤
│ Name   │ Cluster    │ IP        │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU   │ Slow │ Uptime     │ RTT │
├────────┼────────────┼───────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼──────┼────────────┼─────┤
│ nats-3 │ nats       │ 0.0.0.0   │ 2.8.4   │ yes │ 180   │ 4,802 │ 4      │ 0   │ 399 MiB │ 261.0 │ 2    │ 7d5h57m7s  │ 2ms │
│ nats-4 │ nats       │ 0.0.0.0   │ 2.8.4   │ yes │ 199   │ 4,802 │ 4      │ 0   │ 330 MiB │ 154.0 │ 0    │ 7d5h58m37s │ 2ms │
├────────┼────────────┼───────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼──────┼────────────┼─────┤
│        │ 1 Clusters │ 2 Servers │         │ 2   │ 379   │ 9,604 │        │     │ 730 MiB │       │ 2    │            │     │
╰────────┴────────────┴───────────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴──────┴────────────┴─────╯

╭────────────────────────────────────────────────────────────────────────────╮
│                              Cluster Overview                              │
├─────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats    │ 2          │ 0                 │ 0                 │ 379         │
├─────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│         │ 2          │ 0                 │ 0                 │ 379         │
╰─────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

report jetstream, not working.

./nats server report jetstream
nats: error: no results received, ensure the account used has system privileges and appropriate permissions, try --help

I tried to restart cluster with kubuctl rollout restart, but server 4 keep quit after some time.

[6] 2022/06/07 08:58:40.455106 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[6] 2022/06/07 08:58:50.455004 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"
[6] 2022/06/07 09:00:08.409255 [INF] Entering lame duck mode, stop accepting new clients

Here is log from server 0 and 4 server 0 nats-0.log server 4 before restart nats-4.log server 4 after restart nats-4-restart.log

By the way, Is there a way to restore cluster other than purge all jetstream data?

derekcollison commented 2 years ago

What is the file descriptor limits that are being set for the nats-server?

There are ways to re-stabilize the cluster but are dependent on why the cluster is shutting itself down and disabling things. It is likely running out of some resource. The common ones are memory, cpu and disk, but we have seen occasions where the FD limit (ulimit -a under open files) is set too low.

ianzhang1988 commented 2 years ago

cpu and memory are low.

# ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
memory(kbytes)       unlimited
locked memory(kbytes) 64
process              unlimited
nofiles              1048576
vmemory(kbytes)      unlimited
locks                unlimited
rtprio               0

derekcollison commented 2 years ago

The system, I think, had an event and certain servers shutdown JetStream to a point that the cluster lost quorum.

Would need to jump on a call with you to debug further.

Are you a paying customer at this point?

ianzhang1988 commented 2 years ago

No, not a paying customer.

Our cluster is in k8s, with ceph-rdb as pv. Each server have 10 cpu and 20g memory. Seems enough resouces.

I purged all jetstream data and restarted cluster. Now cluster is working. cpu is lower than previous run(1000m k8s top), about 10m now for each server with about 1200 MQTT clients.

I add another 20k MQTT clients, cluster still work ( cpu 100-200m ) but there is some strange number in nats server report jetstream:

╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                         JetStream Summary                                         │
├─────────┬─────────┬─────────┬───────────┬──────────┬────────┬────────┬────────┬─────────┬─────────┤
│ Server  │ Cluster │ Streams │ Consumers │ Messages │ Bytes  │ Memory │ File   │ API Req │ API Err │
├─────────┼─────────┼─────────┼───────────┼──────────┼────────┼────────┼────────┼─────────┼─────────┤
│ nats-1  │ nats    │ 3       │ 4,153     │ 20,832   │ 13 MiB │ 0 B    │ 13 MiB │ 11,987  │ 0       │
│ nats-4* │ nats    │ 3       │ 4,136     │ 20,832   │ 13 MiB │ 0 B    │ 13 MiB │ 12,370  │ 141     │
│ nats-3  │ nats    │ 3       │ 4,170     │ 20,832   │ 13 MiB │ 0 B    │ 13 MiB │ 406,768 │ 32,087  │
│ nats-2  │ nats    │ 3       │ 4,184     │ 20,832   │ 13 MiB │ 0 B    │ 13 MiB │ 12,128  │ 0       │
│ nats-0  │ nats    │ 3       │ 4,204     │ 20,832   │ 13 MiB │ 0 B    │ 13 MiB │ 12,261  │ 0       │
├─────────┼─────────┼─────────┼───────────┼──────────┼────────┼────────┼────────┼─────────┼─────────┤
│         │         │ 15      │ 20,847    │ 104,160  │ 66 MiB │ 0 B    │ 66 MiB │ 455,514 │ 32,228  │
╰─────────┴─────────┴─────────┴───────────┴──────────┴────────┴────────┴────────┴─────────┴─────────╯

╭───────────────────────────────────────────────────╮
│            RAFT Meta Group Information            │
├────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name   │ Leader │ Current │ Online │ Active │ Lag │
├────────┼────────┼─────────┼────────┼────────┼─────┤
│ nats-0 │        │ true    │ true   │ 0.04s  │ 0   │
│ nats-1 │        │ true    │ true   │ 0.04s  │ 0   │
│ nats-2 │        │ true    │ true   │ 0.04s  │ 0   │
│ nats-3 │        │ true    │ true   │ 0.04s  │ 0   │
│ nats-4 │ yes    │ true    │ true   │ 0.00s  │ 0   │
╰────────┴────────┴─────────┴────────┴────────┴─────╯

nats-3 have lots of API Err, and API Req higher than others.

nats consumer ls \$MQTT_msgs show lots of repeat consumer id.

Consumers for Stream $MQTT_msgs:

        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw
        008J1Juo_cG7WSURR5O2VoxtpNfE0Yw

nats consumer ls \$MQTT_msgs | wc -l show there are 437917 consumers, about 20k actually there.

I don't sure if our k8s and ceph is causing this, or MQTT connection.

Are there recommended deployment or configuretion for MQTT broker setup for million connecionts?

Any help is appreciated.

derekcollison commented 2 years ago

Ceph may be causing issues, could you re-run test with direct attached storage? I also noticed the high number of errors, be good to understand what is triggering those.

We are continuing to improve NATS and its application in environments like yours with both high numbers of connections but also high number of consumers.

If you find value in the system, I would encourage you to look into a NATS Enterprise subscription which provides architectural reviews, performance reviews, better access to the team and support SLAs. This is how we continue to add value to the ecosystem as well.

ianzhang1988 commented 2 years ago

Ceph may be causing issues, could you re-run test with direct attached storage?

I will try this later with k8s EmptyDir.

I also noticed the high number of errors, be good to understand what is triggering those.

Are these errors output to stdout/err? I did not see related error in log. part of log: nats-3-0608.log

NATS Enterprise subscription

Well, my company is not ready to commit into this, sadly.

Anyway, thanks for the sugestion.

derekcollison commented 2 years ago

The API Errors are returned to the application that requested it and tracked, but I do not think the server logs them directly since they could cause operational issues with filling up / spamming logs.

derekcollison commented 2 years ago

Let us know about the run without ceph, be good feedback.

We have a bunch of folks doing similar systems (sans k8s and ceph) with MQTT and large number of connections and now consumers for QoS1.

We will of course prioritize paying customers who are helping fund improvements to the ecosystem, but will circle back here on updates and time permitting.

ianzhang1988 commented 2 years ago

The API Errors are returned to the application that requested it and tracked, but I do not think the server logs them directly since they could cause operational issues with filling up / spamming logs.

That might ge hard to catch.

Let us know about the run without ceph, be good feedback.

Will do.

derekcollison commented 2 years ago

The NATS cli has a traffic function that might be able to help here. /cc @ripienaar

ianzhang1988 commented 2 years ago

Using EmptyDir in k8s seems the same with ceph.

nats-server:2.8.4

./nats server report jetstream

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                            JetStream Summary                                             │
├─────────┬─────────┬─────────┬───────────┬───────────┬─────────┬────────┬─────────┬───────────┬───────────┤
│ Server  │ Cluster │ Streams │ Consumers │ Messages  │ Bytes   │ Memory │ File    │ API Req   │ API Err   │
├─────────┼─────────┼─────────┼───────────┼───────────┼─────────┼────────┼─────────┼───────────┼───────────┤
│ nats-2* │ nats    │ 3       │ 50,370    │ 238,744   │ 65 MiB  │ 0 B    │ 65 MiB  │ 1,294,471 │ 283,934   │
│ nats-4  │ nats    │ 3       │ 49,154    │ 238,744   │ 65 MiB  │ 0 B    │ 4.4 GiB │ 90,668    │ 2,120     │
│ nats-3  │ nats    │ 3       │ 50,004    │ 238,744   │ 65 MiB  │ 0 B    │ 1.7 GiB │ 143,696   │ 86,814    │
│ nats-0  │ nats    │ 3       │ 49,310    │ 238,744   │ 65 MiB  │ 0 B    │ 65 MiB  │ 57,460    │ 0         │
│ nats-1  │ nats    │ 3       │ 50,576    │ 238,744   │ 65 MiB  │ 0 B    │ 65 MiB  │ 1,501,363 │ 1,174,289 │
├─────────┼─────────┼─────────┼───────────┼───────────┼─────────┼────────┼─────────┼───────────┼───────────┤
│         │         │ 15      │ 249,414   │ 1,193,720 │ 324 MiB │ 0 B    │ 6.3 GiB │ 3,087,658 │ 1,547,157 │
╰─────────┴─────────┴─────────┴───────────┴───────────┴─────────┴────────┴─────────┴───────────┴───────────╯

╭───────────────────────────────────────────────────╮
│            RAFT Meta Group Information            │
├────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name   │ Leader │ Current │ Online │ Active │ Lag │
├────────┼────────┼─────────┼────────┼────────┼─────┤
│ nats-0 │        │ true    │ true   │ 0.10s  │ 0   │
│ nats-1 │        │ true    │ true   │ 0.10s  │ 0   │
│ nats-2 │ yes    │ true    │ true   │ 0.00s  │ 0   │
│ nats-3 │        │ true    │ true   │ 0.10s  │ 0   │
│ nats-4 │        │ true    │ true   │ 0.10s  │ 0   │
╰────────┴────────┴─────────┴────────┴────────┴─────╯

At 250k MQTT connections, each server use about 17G memory and 1 cpu. The memory usage is rather high, Is this normal?

And, When I did a rolling update in k8s, before nats-4 terminate I saw lots print like below:

[ERR] 39.165.228.26:42038 - mid:201907 - "20e380baec77b6969d84d247f4f47f454702e898897d3a0f_1655601250210" - unable to delete consumer "iFbQu1ko_2VC5dm51kKQq5sKc1xeijB" for session "20e380baec77b6969d84d247f4f47f454702e898897d3a0f_1655601250210": timeout for request type "CD" on "$JS.API.CONSUMER.DELETE.$MQTT_msgs.iFbQu1ko_2VC5dm51kKQq5sKc1xeijB" (reply="$MQTT.JSA.HuYMtjaW.CD.jLo6AIdktka5eZiWmznpZm")

Then, nats-4 print some:

Healthcheck failed: "JetStream is not current with the meta leader"

and exit. Rolling update stuck at nats-4. I had to delete the statefullset then redeploy nats.

Are 250k MQTT connections too much for jetstream? Or Cloud this be too many oprations at same time, Causing API timeout and lost "quorum" or "state"? Our 250k MQTT clients connect to cluster at roughly the same time, When rolling update, server would need to delete 25k consumer at a short time. Then rejoin the cluster with empty storage.

Any suggestion would be appreciated.

derekcollison commented 2 years ago

The 250k connections should be ok. Its the number of consumers that are being created. If they are all R>1 that will tax the system.

What does a nats consumer info show for one of the consumers?

ianzhang1988 commented 2 years ago

Replica of consumer is set to 1.

Information for Consumer $MQTT_msgs > YxAeGyH7_Pbdjs4f3abZxOwx6aJK31u created 2022-06-21T17:20:05+08:00

Configuration:

        Durable Name: YxAeGyH7_Pbdjs4f3abZxOwx6aJK31u
    Delivery Subject: $MQTT.sub.Pbdjs4f3abZxOwx6aJK2ym
      Filter Subject: $MQTT.msgs.74b080ece677b7909d838747a3f57f46220bd59873ae65.recv
      Deliver Policy: New
          Ack Policy: Explicit
            Ack Wait: 1m0s
       Replay Policy: Instant
     Max Ack Pending: 100
        Flow Control: false

Cluster Information:

                Name: nats
              Leader: nats-3

State:

   Last Delivered Message: Consumer sequence: 0 Stream sequence: 0
     Acknowledgment floor: Consumer sequence: 0 Stream sequence: 0
         Outstanding Acks: 0 out of maximum 100
     Redelivered Messages: 0
     Unprocessed Messages: 0
          Active Interest: No interest

Any suggestion on tools that can debug nats-server. I think my best option now is looking into the code of MQTT and find what's going wrong.

derekcollison commented 2 years ago

We have added the ability to auto-cleanup orphaned MQTT consumers. This is in nightly and will be included in next release in next week or two.

Would that be something that might be helpful here?

ianzhang1988 commented 2 years ago

I'm not sure. The number of consumers in nats server report jetstream and nats stream report are actually very close with connctions of MQTT clients. Only nats consumer ls (nats 0.32) show lots more and repeated consumer id. Maybe consumers in server are working normally.

Although at one rolling update，one nats server stuck at geting ready with Healthcheck failed: "JetStream consumer 'ipes > $MQTT_msgs > d0AwD8KT' is not current" ( nats consumer info d0AwD8KT say it dose not exists), maybe auto-cleanup would help here.

For now the main problems are the following:

Rolling update got stuck by failing health check. ( I saw consumer not current, jetstream not current, and someting lost quorum) We can not put nats into production if there is risk of rolling update get stuck.
High memory usage. with 250k MQTT client connecting to 5 nats server. each server use more than 24Gi causing pod exit. ( I tried to use 9 server with 16Gi, I saw no improvement.)
High API Err number show in nats server report jetstream. ( not sure if it cause any thing to go wrong, but still )

By using EmptyDir, When update nats server would get a empty new data dir for jetstream. And lost all consumers from previous. What would server do at startup? Would one server sync data from others in ther raft group? Would this causing consumer being not current?

But，when I used ceph-rdb, there are chances that rolling update stuck as well. I had to clean all data from pv (jetstream data) to restart cluster.

I am really confusing right now. Any ideas?

derekcollison commented 2 years ago

I would not recommend having 250k R3 consumers. You should configure the server to make consumers R1, I would also not recommend using ceph. You should use direct attached or network attached block storage like EBS.

Do you have a support contract with Synadia?

ianzhang1988 commented 2 years ago

consumers R1

my config:

mqtt {
  stream_replicas: 5
  consumer_replicas: 1
}

R1 you mean repliaca set to 1, right?

direct attached or network attached block storage

In our k8s, ceph is used as network attached block storage though k8s StorageClass. Mounted into pod as Persistent Volume.

EmptyDir is local storage.

Do you have a support contract with Synadia?

No, I'm not a paying customer.

ianzhang1988 commented 2 years ago

Is there any suggestion how can I debug these? I add pprof and gops trying to figure out memory usage.

derekcollison commented 2 years ago

ok very good on R1 consumers.

We would need to schedule a Zoom call with you and a member of the team to triage. Many moving parts with K8S and storage subsystems and resource limits and networks, hard to do via GH issue in our experience.

ianzhang1988 commented 2 years ago

That would be great. The time diffrence might be tricky, I'm at UTC+8. Anyway, I will do some further research. Hoping to find some clue.

And thanks for your patience.

GuanYin108 commented 2 years ago

Out of curiosity, is there any progress on that topic?

nats-io / nats-server