add ref counting for natsMsg (follow up of https://github.com/nats-io/nats.c/issues/793)

thierryba commented 2 months ago

Proposed change

we could use atomic_int (?) to add a ref counter. to natsMsg and avoid issues like https://github.com/nats-io/nats.c/issues/793

Use case

Today, my callback for nats messages is taking a reference on the natsMsg and calling natsMsg_Destroy when it is done doing its stuff in a specific thread. So all asynchronous.

Contribution

Sure, I have done a POC in https://github.com/thierryba/nats.c/tree/jetstream_refcount_crash (in particular https://github.com/nats-io/nats.c/commit/4e5a1fa99847917d1f4241d8df8c5a8ea2089a32).

No idea if it is ok to use atomic_int (C11). @kozlovic please comment.

kozlovic commented 2 months ago

That header is not supported on Windows it seems. Anyway, let me discuss this with some other team members and we may have a different approach for this specific case. In the meantime, to workaround this issue, you could always create a new message out of the one in the callback.

kozlovic commented 2 months ago

I had done the work of copying the message's reply into either a static buffer or if too big allocate/free so we don't need the message for the msg->reply field, but then the test JetStreamSubscribe failed because it creates a subscription with "auto-ack" (which is the default, that is not setting the manual ack option), but in the callback calls natsMsg_Nak(). The test then ensures that the library is not sending an ACK following that, which without having the reference to the message we can't know if the message has been ack'ed or nak'ed.

I think we will have to leave it as-is and have you (@thierryba ) make a copy of the message if you want it to survive the callback. But we should probably document this behavior somewhere... @levb what do you think?

thierryba commented 2 months ago

@kozlovic this was mainly about to try and help. If windows is the only issue, with a couple of ifdefs and using the right API (I did COM in my past...), I can fix this. Now if this is not a path you guys want to take, I won't spend more time on it. As for the workaround, I implemented one in my code so I am ok. Thanks.

levb commented 2 months ago

2/5 I think that natsMsg should be ref-counted if not now then eventually. 1/5 I'd like to port memory pools from mininats to main. A natsMsg then would have a pool of its own, with itself and all of its lifetime dependencies allocated in there (typically 1-2 mallocs for all of the message metadata/handling, plus the data). The pool itself would be ref-counted, natsMsg can just wrap its retain/release.

As to whether to use atomics - I am 0/5, benchmarking would tell? As @thierryba said, we can always ifdef for the platforms that don't support them.

kozlovic commented 2 months ago

As @thierryba said, we can always ifdef for the platforms that don't support them.

And do what instead? I mean you can't have thread-safety in one platform and not the other. So that would mean using locks. Even just atomic may be something that would need to be bench'ed. The natsMsg object has never been thread-safe, and I think it should stay that way - unless performance is no longer a concern.

thierryba commented 2 months ago

I suppose it is not mandatory to support c11 yet.. Windows I know the API but if you want to keep compatibility with more exotic platforms, that's harder. I used to work on Qt and they have had this for more than a decade so it is doable but not sure of the cost of this.

kozlovic commented 2 months ago

@thierryba @levb I think we missed the forest for the threes with this issue. The library blocks the destroy of the message because it needs to auto-ack the message and at the same time needs to know if the message was Nak'ed, etc... then it will destroy the message.

We don't need reference counting. In your case @thierryba, I think you just need to set the subscription with manual ack (jsSubOptions.ManualAck: http://nats-io.github.io/nats.c/structjs_sub_options.html#a21a905c49dfab61729d8674702d196b2), then you can do whatever you want with the message, ack'ing, Nak'ing, delaying its destroy. You are in control.

Please consider that and I think you will agree that the reference counting issue becomes moot.

thierryba commented 2 months ago

@kozlovic thanks for the answer. IIUC that manual ack would be also destroying the message. So that could be an option... but then you never know exactly when that ack will happen. I think I prefer the idea of copying subject and data around. Seems safer. But thanks a lot for the suggestion.

kozlovic commented 2 months ago

manual ack would be also destroying the message

@thierryba No, this is not what I meant. The "auto-ack" one is the one causing the problem. The issue, as I tried to described earlier, is that it needs to ack the message on behalf of the user, but it MUST not do it if the user has Nak'ed (negative ack) the message in the callback for instance. So the library disables the action of the natsMsg_Destroy() call that the user may/should have made in the callback so that it still has access to the message to check if the message has been (n)ack'ed by the user (we set an internal flag in the message). Then it destroys it, because it had no way of knowing that the user did not, and since the library had disabled destroy, it now has the responsibility to do it. Writing this, I realize that I could have use another internal message flag so that if natsMsg_Destroy() is called while the actual destroy is disabled (actually with the use of an internal flag), then the library would know that when the user callback returns and after unsetting the "do not destroy" flag, and if the "destroyed was called when disabled", it could then call natsMsg_Destroy(), otherwise it would not - leaving this to the responsibility of the user, or a memory leak would ensue.

But the approach of the subscription manual ack option is simple: the library simply invokes the callback and does nothing in term of message acknowledgment nor destruction of the message (like in any other NATS subscription callback, it is the responsibility of the user to destroy the given message). This is why I said that you could use that option and then ack the message in your callback, and for reasons that you did not really discuss, if the message needs to survive the callback lifetime, then call natsMsg_Destroy() later on.

nats-io / nats.c