nats-io / nats.c

A C client for NATS
Apache License 2.0
387 stars 134 forks source link

Client connection issue with large messages #287

Closed arun11299 closed 4 years ago

arun11299 commented 4 years ago

Hi, C client version: 1.8.0 I am using the async mode by hooking it up with my event loop. It has been working pretty well till now when the message sizes were less. We now have a need to support few infrequent queries for which we have to transfer large record size...say around 21 MB for the case which its failing.

Also, I have set max message size on NATS server as 64MB.

Using "natsConnection_PublishRequest" to send this data and do not see any error reported by it. I am assuming the data would be sent asynchronously, maybe thats why.

The issue seen is:

  1. Client times out waiting for the data. Client timeout is 60 seconds.

  2. Nats server trace log doesn't show much activity during the send time.

    [24753] 2019/11/23 16:42:44.801258 [TRC] 127.0.0.1:50970 - cid:29 - <<- [PUB cnctbWdtdGFndC1kdHM6bWdtdC1hZ2VudC1kdHM= _INBOX.Ia2XyfgeYvhvN2rq341m3H.Ia2XyfgeYvhvl5rq341m3H 427]
    [24753] 2019/11/23 16:42:44.801325 [TRC] 127.0.0.1:50970 - cid:29 - <<- MSG_PAYLOAD: ["\xf2á\xa0\a\v\x8a։\xce\n\x05admin\x9a\xf3\xfb\xa5\a I,/rw-mgmtagt-dts:mgmt-agent-dts\xfa\x93\xb7\x98\r\xed\x02\xb2\xe9\xfd\xbc\a\xe6\x02\xba\xc5\xd1\xc9\x01OD,/rw-project:project[rw-project:name='default']/nsr:ns-instance-opdata/nsr:nsr\xa8\x9e\xfb\xce\x05\xb3\xaf\xed\xdc\xe0\xb7\xc3\xccn\x82\xffÙ\r\xf6\x01<data><rw-project:project xmlns:rw-project=\"http://riftio.com/ns/riftware-1.0/rw-project\"><name>default</name><nsr:ns-instance-opdata xmlns:nsr=\"urn:ietf:params:xml:ns:yang:nfvo:nsr\"><nsr:nsr/></nsr:ns-instance-opdata></rw-project:project></data>\xc0\xd1\u0378\x0e\x03"]
    [24753] 2019/11/23 16:42:44.801343 [TRC] ::1:52342 - cid:14 - ->> [MSG cnctbWdtdGFndC1kdHM6bWdtdC1hZ2VudC1kdHM= 2 _INBOX.Ia2XyfgeYvhvN2rq341m3H.Ia2XyfgeYvhvl5rq341m3H 427]
    [24753] 2019/11/23 16:42:48.041414 [TRC] 127.0.0.1:50730 - cid:3 - <<- [PUB cnctb3BlbmlkYy1wcm92aWRlcjpvcGVuaWRjLXByb3ZpZGVyLXN0YXRl.cnctb3BlbmlkYy1wcm92aWRlcjpzdGF0aXN0aWNz.> RW.DummyReply 87]
    [24753] 2019/11/23 16:42:48.041446 [TRC] 127.0.0.1:50730 - cid:3 - <<- MSG_PAYLOAD: ["\xf0\xf1\xb7H\x00\xd2Ӌ\xd5\fLD,/rw-openidc-provider:openidc-provider-state/rw-openidc-provider:statistics"]
    [24753] 2019/11/23 16:42:51.585139 [DBG] 127.0.0.1:50940 - cid:27 - Client Ping Timer
    [24753] 2019/11/23 16:42:51.585184 [TRC] 127.0.0.1:50940 - cid:27 - ->> [PING]
    [24753] 2019/11/23 16:42:51.587952 [TRC] 127.0.0.1:50940 - cid:27 - <<- [PONG]
    [24753] 2019/11/23 16:42:51.594392 [TRC] 127.0.0.1:50940 - cid:27 - <<- [PING]
    [24753] 2019/11/23 16:42:51.594409 [TRC] 127.0.0.1:50940 - cid:27 - ->> [PONG]
    [24753] 2019/11/23 16:42:51.618641 [TRC] ::1:52342 - cid:14 - <<- [PUB _INBOX.Ia2XyfgeYvhvN2rq341m3H.Ia2XyfgeYvhvl5rq341m3H 21899031]
    [24753] 2019/11/23 16:42:55.518576 [DBG] 127.0.0.1:50954 - cid:28 - Client Ping Timer
    [24753] 2019/11/23 16:42:55.518631 [TRC] 127.0.0.1:50954 - cid:28 - ->> [PING]
    [24753] 2019/11/23 16:42:55.519897 [TRC] 127.0.0.1:50954 - cid:28 - <<- [PONG]
    [24753] 2019/11/23 16:42:55.549572 [TRC] 127.0.0.1:50954 - cid:28 - <<- [PING]

    cid 14 is our client connection. See that the payload is not sent, just the header.

  3. After about 5 minutes, I see that the NATS client connection is disconnected and reconnected.

    [24753] 2019/11/23 16:48:28.464233 [DBG] ::1:52342 - cid:14 - Client connection closed
    [24753] 2019/11/23 16:48:28.464920 [DBG] ::1:52758 - cid:30 - Client connection created
    [24753] 2019/11/23 16:48:28.465216 [TRC] ::1:52346 - cid:13 - <<- [PING]
    [24753] 2019/11/23 16:48:28.465235 [TRC] ::1:52346 - cid:13 - ->> [PONG]
    [24753] 2019/11/23 16:48:28.470177 [DBG] ::1:52346 - cid:13 - Client Ping Timer
    [24753] 2019/11/23 16:48:28.470210 [TRC] ::1:52346 - cid:13 - ->> [PING]
    [24753] 2019/11/23 16:48:28.471261 [TRC] ::1:52346 - cid:13 - <<- [PONG]
    [24753] 2019/11/23 16:48:28.476708 [TRC] ::1:52758 - cid:30 - <<- [CONNECT {"verbose":false,"pedantic":false,"tls_required":false,"name":"","lang":"C","version":"1.8.0","protocol":1,"echo":true}]
    [24753] 2019/11/23 16:48:28.476774 [TRC] ::1:52758 - cid:30 - <<- [PING]
    [24753] 2019/11/23 16:48:28.476784 [TRC] ::1:52758 - cid:30 - ->> [PONG]
    [24753] 2019/11/23 16:48:28.490805 [TRC] ::1:52758 - cid:30 - <<- [SUB cnctbWdtdGFndC1kdHM6bWdtdC1hZ2VudC1kdHM=  2]
    [24753] 2019/11/23 16:48:28.490867 [TRC] ::1:52758 - cid:30 - <<- [SUB cnctbmF0cy1tc2c6bm90aWZpY2F0aW9uLW1zZw==.>  3]
    [24753] 2019/11/23 16:48:28.490894 [TRC] ::1:52758 - cid:30 - ->> [-ERR Unknown Protocol Operation]
    [24753] 2019/11/23 16:48:28.490966 [ERR] ::1:52758 - cid:30 - Client parser ERROR, state=0, i=100: proto='" 'f99c77c2-0cfa-11ea-ac55-02420a"...'
    [24753] 2019/11/23 16:48:28.490977 [DBG] ::1:52758 - cid:30 - Client connection closed

See that the old cid 14 is closed and new one, cid 30 is opened, but fails with "parser ERROR". The proto field has the data which we were trying to send on the old connection.

Please help in resolving this issue.

Thanks in advance.

arun11299 commented 4 years ago

Seems bit similar to whats described in https://github.com/nats-io/nats.c/issues/275. But thats only the second part. Why would the data not be sent and result in client disconnection ?

arun11299 commented 4 years ago

Also seeing the same issue when the message size is around 3MB as well

kozlovic commented 4 years ago

@arun11299 I have tried with the v1.8.0 library and latest server and I can't reproduce the issue. Note that having -DV for the server is a bad idea when you will get it to work because the server printing a 20+ MB payload is bad news for performance ;-)

It could be the server version (if you let me know which one you used I can try), but I suspect that it is more on how you are running your application. As you know, publishing a message in NATS C client when using an event loop is merely copying data into a buffer and adding an event to the event loop to let it know that it should start polling for write events. Then, the event loop as to be "running" for data to be actually written to the socket.

I would recommend that you read this if not already done. Also, hopefully you have seen that libuv is not thread-safe so you will have to call this in order to let the NATS C Client library know if it should schedule some events to the event loop thread or invoke some libuv calls directly.

I took the examples/libuv-sub.c code and added this in the onMsg() callback (before this line):

    {
        const char *reply = NULL;

        if ((reply = natsMsg_GetReply(msg)) != NULL)
        {
            int         dataLen = 21*1024*1024;
            char        *data   = calloc(1, dataLen);
            natsStatus  s;

            s = natsConnection_Publish(nc, reply, data, dataLen);
            if (s != NATS_OK)
                printf("@@IK: Error on publish: %d - %s\n", s, natsStatus_GetText(s));

            free(data);
        }
    }

then, using a requestor, I sent a request and everything worked fine. Again, if running the server with -DV, the server starts to print the payload which is very big and causes the requestor to timeout (the examples use a low timeout). By the way, the more recent NATS Server has an option to limit the size of the payload printed when running with trace level (max_traced_msg_len).

arun11299 commented 4 years ago

Hi @kozlovic Thanks for the explanation. Really appreciate it. We are using CF runloop as the the backing event loop. But your explanation certainly gave more idea on debugging this issue. It appears that we have to enable the option to automatically reenable writes on socket which otherwise was not happening in case of partial writes of the data.

Doing that has fixed our issue. Thanks for all your efforts!!