Publishing suddenly stops and channel workload increases indefinetly

LDonato commented 6 years ago

Good morning.

I'm developing an application that should read data from external sources (e.g. modbus, OPC etc.) and process it through a sequence of calculation blocks. I'm using RabbitMQ and RawRabbit to implement a Data Flow Pattern to route the information among the blocks. I am using one instance of the bus client shared across the application and all publishers and subscribers are currently operating on the same process. Currently the data are coming from a simulated source that just spits out random data.

Everything seems to work fine, but after a while (usually around 48 hours) the publishing suddenly stops and the application starts allocating massive amounts of memory.

The process just keeps running, until it stops due to an OutOfMemoryException.

The only clue I found about what's going on is the following line in RawRabbit logs:

Channel pool currently has 1 channels open and a total workload of n

Until the publishing stops n is zero. Then suddenly this number starts to increase by 20 at each publish attempt.

The server seems unaware that anything wrong is happening as well.

I've been working with RawRabbit and RabbitMQ only for a very short time so please pardon my ignorance, I really can't figure out what's going on under the hood at the moment.

This is how I publish data

await this.databus.PublishAsync(
                    dataPackage, 
                    ctx => ctx.UsePublishConfiguration(
                        cfg => cfg.OnDeclaredExchange(exchange => exchange.WithName("datahub")))
                        .UsePublishAcknowledge(use:false)
                );

I removed the publish acknowledge early on because I had some exceptions raising, which I found weird considering that the server is on the localhost. Probably by doing so I just hid something under the carpet ... I'll run another test with publish acknowledge and see what happens.

Thanks for any help you can provide. Best regards Luca

pardahlman commented 6 years ago

Hello @LDonato - thanks for reaching out. Can you provide logs from when this happens? It would also be interesting to see the broker logs, especially if the broker noticed that the connection is lost at about the same time that you experience this.

I'm guessing this is related to #299 and #301 - where a bug in the channel pool caused multiple channels to be created if the connection to the broker is lost. Since you have had problems with publish acks, it is a pretty good guess that it is related.

LDonato commented 6 years ago

Thanks for your quick reply.

The log from the broker is not very informative, I'll see if I can start another run enabling debug level. The broker does not see any lost connection until I stop the application.

The incident happened at 01:33:00. That's the timestamp of the last data processed by the subscribers.

I felt the same about #299 and #301 and in fact I'm currently trying a run with rc-3 to see if the problem is still there.

log-rawrabbit-20180126.zip rabbit@debian.log

Thanks Luca

pardahlman commented 6 years ago

Perfecto 👌,

However as stated in #301 there is still some issues with the latest release. However, I added some more logging when the connection is lost, so perhaps it will shed some light on this issue.

LDonato commented 6 years ago

It happened again with rc3 after more or less 24 hours... can't see anything different from before in the logs, but I attach it anyway (18:47:15)

The broker logs did not write anything at all. I took a screenshot of the diagnostic page of rabbitmq while the program was still running and the client appeared to be connected ... schermata 2018-01-27 alle 19 30 48 log-rawrabbit-20180127.zip

Any suggestions on some tests I might run to gather more information?

Thanks! Luca

pardahlman commented 6 years ago

RawRabbit has been run in production for months without problems. I'm wondering if there is anything in your setup. Looks like you are running everything on your local machine? Perhaps you have some aggressive hibernation settings that affects the client and/or broker? I have noticed that I sometimes need to restart RMQ broker on my dev machine after hibernation. Just an idea...!

LDonato commented 6 years ago

Thanks! I will definetly look into that!

drak25 commented 6 years ago

Just to let you know. I also had to remove publish aknowledgment becuase of exactly same reason : I was getting random exceptions that the acknowledgment was not received before ack timeout.

LDonato commented 6 years ago

Hello @drak25! Thanks for your tip.

I disabled sleep and hybernation (test server is running Debian). System ran for 24+ hours without exhibiting the behavior from the post but as you anticipated this time I had a crash due to random PublishAcknowledgementException (I thought I had them intercepted but one went rogue ... it must have to do with multithreading I guess).

Anyway I am about to start a test with publish acknowledgement exceptions disabled. I will know in 2/3 days if the issue is gone!

pardahlman commented 6 years ago

@drak25 @LDonato : is the issues you're having with publish acks deterministic reproducible? In that case I'd be very interested in a step-by-step instruction in a separate issue.

LDonato commented 6 years ago

Thanks for your interest ... honestly I wouldn't know how to reproduce the problem ... I'm running a test on a demo configuration that is performing the same operations over and over and the exception (or the channel workload issue) seemingly happens after a random time.

I may have a new piece of information though ... it could be that the channel workload issue appears only when I disable the exceptions, but I need to run more tests to be sure.

As you said test setup has probably something to do with this. The only thing I can think of is sending you a VM with my project altogether (I meant to dockerize everything but I wanted to get to the bottom of this problem first)... let me know if you may be interested.

In the meantime, would you mind sharing some thoughts about the environment? Which versions of OS (I must use Linux)/RabbitMQ/Erlang would you recommend?

Thanks Luca

pardahlman commented 6 years ago

Lets hold of with the VM för a while, @drak25 is doing some interesting discoveries as to the channel pool in #301 - perhaps it will mitigate your problem, too. As for OS etc, I think it makes sense to look in RabbitMQ documentation for this.

LDonato commented 6 years ago

Hi! A little update ... Fresh clean installation with RabbitMQ running on a docker container, the problem unfortunately is still there. So I suppose I can rule out a configuration issue.

Since the system can detect the channel workload increasing I would be curious to try to scale up. I'd like to see if the newly spawned channel would resume working correctly and I'd be curious to see what would happen to the previously created channel? Would it be destroyed or would the system keep spawning channels till it reaches the maximum allowed number?

Unfortunately I can't figure out how to enable this feature... how do I set DesiredAverageWorkload and the like? Is it enough to assign the value to AutoScalingOptions.Default before creating the client?

Thanks again Luca

LDonato commented 6 years ago

I created a model of what goes on in my application in order to replicate the issue. It's not deterministic but at least the problem manifests in minutes, not hours.

RawRabbitTest.zip

LDonato commented 6 years ago

Playing around with the project from the post above I start to think the issue is related to the message forwarding, publishing from within the scope of a subscriber.

I looked deeper in the docs and I modified the code to this

 IBusClient client = RawRabbit.Instantiation.RawRabbitFactory.CreateSingleton(
                options: new RawRabbit.Instantiation.RawRabbitOptions()
                    {
                        Plugins = p => p.UseContextForwarding()
                    }
            );

client.SubscribeAsync<List<Message>>(
                async (message) => {
                    logger.Information("received data {0}.Forwarding ...", message[0].cnt);
                    await client.PublishAsync(message,
                        ctx => ctx.UsePublishConfiguration(
                            cfg => cfg.OnDeclaredExchange(exchange => exchange.WithName("queue2"))
                        )
                        .UsePublishAcknowledge(false)
                        );
                },
                ctx => ctx.UseSubscribeConfiguration(
                    cfg => cfg.OnDeclaredExchange(exchange => exchange.WithName("queue1")
                )
            ).UseMessageContext(c => c.GetDeliveryEventArgs()));

Is this the proper way to forward messages?

pardahlman commented 6 years ago

Kind of...

The UseMessageContext you use tells the middleware that invokes the message handler to try to invoke it with BasicEventArgs, see this test for reference.

Context forwarding refers to message context forwarding.

I'm a bit uncertain as to the naming of the exchanges (queue1 and queue2 sounds like queue names to me).

Forwarding a message from within a message handler is supported without activating any plugins, or tweaks, so you could try to remove the plugin and UseMessageContext if you want to minimize the complexity.

LDonato commented 6 years ago

Thanks for your reply ... queue1 and queue2 are just names I used in that toy project to have different exchanges and avoid to have a loop with a subscriber forwarding data to itself.

if I removed the plugin and the UseMessageContext I'd be back to square one. :(

I thought it would matter because I noticed that in 1.x you'd always do something like this when publishing from within a consumer's scope firstSubscriber.PublishAsync(new SimpleMessage(), i.GlobalRequestId);

LDonato commented 6 years ago

I've been running some tests on the sample project I shared before and when forwarding messages it looks like the plug in is a game changer. The VM with the plugin has been running non stop for 5 hours now while every test without context forwarding breaks in less than ten minutes.

pardahlman commented 6 years ago

You are taking about UseContextForwarding, right? That is very interesting!

LDonato commented 6 years ago

Yes I'm talking about UseContextForwarding. Update ... after 7 hours it stopped working as well.

drak25 commented 6 years ago

I started to investitage the Ack timeout issue.

This one is a bit harder to catch.

As for now i know that sometimes something strange is happening in the middlewares. The severity of the issue depends on how many threads are publishing messages using single client - of course more thrads means more timeouts.

When the Ack timeout occurs i checked what the threads in the app are doing. There are some threads that seems to be bouncing between StageMarkerMiddleware and CancellationMiddleware , stack of the example thread :

[External Code]     Annotated Frame

RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.ExchangeDeclareMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 48 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.BodySerializationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 32 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.BasicPropertiesMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 45 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.PooledChannelMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 32 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.Operations.Publish.dll!RawRabbit.Operations.Publish.Middleware.ReturnCallbackMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 56 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.Operations.Publish.dll!RawRabbit.Operations.Publish.Middleware.PublishAcknowledgeMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 66 C# Symbols loaded. [Resuming Async Method] Annotated Frame [External Code] Annotated Frame RawRabbit.Operations.Publish.dll!RawRabbit.Operations.Publish.Middleware.PublishAcknowledgeMiddleware.SetupTimeout.AnonymousMethod0(object state) Line 160 C# Symbols loaded. [External Code] Annotated Frame [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.CancellationMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 12 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.Pipe.Middleware.StageMarkerMiddleware.InvokeAsync(RawRabbit.Pipe.IPipeContext context, System.Threading.CancellationToken token) Line 33 C# Symbols loaded. [Async Call] Annotated Frame RawRabbit.dll!RawRabbit.BusClient.InvokeAsync(System.Action pipeCfg, System.Action contextCfg, System.Threading.CancellationToken token) Line 26 C# Symbols loaded. [Async Call] Annotated Frame Listonic.Messaging.V2.Common.dll!Listonic.Messaging.V2.Common.Sender.Process.AnonymousMethod2_0(Listonic.Messaging.V2.Interfaces.MessageAsync message) Line 69 C# Symbols loaded.

drak25 commented 6 years ago

@pardahlman You mentioned that you are using RawRabbit in production. Are you using version 2.0 or the previous, more stable version? I am asking becuase i am observing same behaviour as LDonato. For test i run a producer, that was feeding the queue constantly, and consumers that in realtime processed messages from queue. After some time (do not know exactly how long) producers stopped pushing messages to queue. There were tens of threads waiting for something.

I need to push the code to production soon and im starting to think that it would be better to use version 1.0 of library.

LDonato commented 6 years ago

@drak25 any updates from your side? I have tried many tweaks but I can’t get a stable publishing.

drak25 commented 6 years ago

No updates, i am not currently investigating this.

LDonato commented 6 years ago

Thanks for your reply. May I ask if you turned to 1.x as a work around or you set aside the issue altogether?

drak25 commented 6 years ago

No i do not use ver 1.x. Currently i set it aside.

LDonato commented 6 years ago

Hi @pardahlman, sorry to bother you! In order to better plan how to proceed with my project I would like to ask you if you plan to address this issue in the future. After putting my best efforts in this for the last 20 days it is clear that I lack the skills to understand what's going on, at least without a nudge or two in the right direction from someone better than myself :) Thanks. Luca

pardahlman commented 6 years ago

Hi @LDonato - thanks for reaching out. This is definitly something that I (or someone else) will look at. My time is limited, and I'm currently swamped with other work. Can't give any estimate, though I plan to look through the PRs this weekend and possibly release a new RC.

LDonato commented 6 years ago

Thank you very much for your reply!

pardahlman commented 6 years ago

@LDonato @drak25 quick update! #315 addresses the issue you described with publish acknowledgement that occurred when publishing multiple messages concurrently.

pardahlman commented 6 years ago

Hello again @LDonato and @drak25 - RC4 of RawRabbit has been released. It addresses some, but perhaps not all, of the issues related to this issue. In order to be able to trouble shoot more efficiently, I'm closing this - I'd like to year from you in a new issue if you still experience problems with the latest version. Thanks!

LDonato commented 6 years ago

@pardahlman thank you very much for your efforts! I'll set up a run with rc4 asap.

LDonato commented 6 years ago

@pardahlman unfortunately the same thing happened with rc4. I don't have any logs cause by the time it crashed serilog had stopped logging (file > 1.1Gb) but the behavior seemed consistent with what used to happen with previous releases. I don't know if it can be any help but I've been trying out previous releases and in my case beta7 appeared to be stable. A run with beta7 last for more than one week and then crashed due to some connection problems with the broker that I think you've already fixed.

Shall I open a new issue?

Thanks Luca

pardahlman commented 6 years ago

OK, that's too bad. I wonder if it is related to #313... are you doing RPC calls? I'll reopen this now

LDonato commented 6 years ago

Hi! No, I'm only publishing data and subscribing to queues. I uploaded a model of what I do a few comments back, if it can help.

I don't think it's related to #313... though I read an issue where they claimed that after upgrading from beta7 to rc2 an application that had been running for a month started to exhibit a behavior similar to mine, but unfortunately I can't find it anymore.

UPDATE: I'm talking about #288

pardahlman commented 6 years ago

Top of the morning! Trying to summarize this issue. Some of the comments relates to publish acks, which has been addressed in a separate issue.

@drak25, you said said

For test i run a producer, that was feeding the queue constantly, and consumers that in realtime processed messages from queue. After some time (do not know exactly how long) producers stopped pushing messages to queue. There were tens of threads waiting for something.

A few follow up questions:

Are these tests locally, meaning publisher, subscriber and broker on the same machine?
Would you say that it took minutes, hours, days or weeks before this issue appeared?
Where you publishing messages throughout your test?
Where you using await when publishing the message?
Did you see any traces of network instability?

@LDonato

Are you running the publish and subscribe in different applications?
Are you running the application and broker on same machine/locally?
Do you see any signs that the subscriber is still working? It would be interesting to see if you could restart the publisher, keeping the subscriber running and see if the message processing starts working again

Thanks for your help!

pardahlman commented 6 years ago

I've tried to reproduce this behavior by publishing ~200 messages/second and subscribing to them in a different application. After 4 hours, the applications are still running as expected. This makes me believe that there is something else happening in your system - any ideas?

LDonato commented 6 years ago

@pardahlman thank you for your time. In my case the broker is on the same machine as the publishers and subscribers and publishers and subscribers are on the same application, so it is not exactly the same configuration as your test I think. Also, I have a situation where I forward messages publishing from within the scope of a subscriber’s callback. I don’t know if it may make any difference but maybe it would be worth including this in your test. You can find a project to reproduce exactly what I am doing a few comments back. In my machine that project stops working after a few minutes usually, although it’s been alive for an hour or so in one occasion.

To answer your other questions I believe that the subscribers are still working but the publishers stop writing on the queue. Being on the same application I would not know how to restart the publisher alone. In the real situation I publish about very few messages ... about 2-4 per second and it crashes in average after 48 hours. I do use await on publishers.

I hope I answered everything, let me know if I can give you any more information.

Thanks

https://github.com/pardahlman/RawRabbit/files/1680583/RawRabbitTest.zip

pardahlman commented 6 years ago

Thank you for the sample, @LDonato - I've been looking into it a bit closer as well as running it. I'm not 100% sure what the intended message flow is, as the topology is a bit unconventional (messages are published to different exchanges, but routed to the same queue that has two consumers... so messages "forwarded" from the first consumer may end up at that consumer again).

Anyways, when running the sample it became clear to me that the number of recieved-but-unacked messages where growing linearly

This means that the client will hold more and more messages in memory, which eventually will break sometime 😉

I modified your sample to use a prefetch count of 100 on both subscribers

client.SubscribeAsync<List<Message>>(
    async (message) => {
        logger.Information("received forwarded data {0}. ...", message[0].cnt);
    },
    ctx => ctx.UseSubscribeConfiguration(cfg => cfg
        .Consume(c => c.WithPrefetchCount(100)) // <-- updated configuration
        .OnDeclaredExchange(exchange => exchange.WithName("queue2"))
    )
);

This limits the number of messages delivered to the conusmer and should give you a more stable behavior. I've been running the (slightly updated) sample for 30 minutes without any problems. It would be interesting to see if you get a similar result.

LDonato commented 6 years ago

Thank you for taking the time to look at the sample. In the real situation the publishing rate is far slower but still it's great food for thought! I'll try this prefetch count in the real application and see what happens.

Thanks again!

LDonato commented 6 years ago

@pardahlman I tried to add the prefetch count to the actual project but nothing changed, it stopped publishing after 50 hours or so, as usual. Channel workload started increasing and the memory usage too. No exception was raised.

untitled

pardahlman commented 6 years ago

Thanks for the update, @LDonato - the fact that it takes up to 50 hours or so before this occurs makes it difficult to reproduce. I'm wondering if it would be possible to:

In this project, do you publish and subscribe in the same application. If yes, would it be possible to split up the application...? Not sure if it follows the same message pattern as the sample app, but if it does perhaps:

One app only pulishing message
One app the subscribes to that messages and publishes a message within message handler
One app that subscribe to the message published within the message handler

Looking at the screenshot, it would be interesting to see how the Connections and Channels tabs looks like. Not sure if you're able to run a profiling tool like dotMemory to get some insight in the running process - that kind of information would be valuable, especially as you notice an increase in memory allocation.

LDonato commented 6 years ago

Thanks for following up. I understand it is not easy to investigate, so I appreciate it very much. Here are the connections and channels from a previous run

I will try to do to as you suggested ... splitting the application sounds good in many ways but doesn't feel right in others ... I will have to rethink the architecture a bit but in the end it's definitly worth trying.

In the meantime I'll set up a a machine with dotMemory and maybe I'll try to produce a sample to replicate the problem faster without incurring in the growing unacked messages. Hopefully I'll get back with the results in a few days.

Can I ask why you think that splitting the application would help?

pardahlman commented 6 years ago

For investigation purposes it might be helpful to verify/isolate the different mechanisms in play. For example, it would be helpful to know if it is enough to restart the "forwarding" service or the "initial publish" service in order to get everything back and running.

As a first step, however, a dotMemory report could give some insight... especially as you are seeing high memory allocation.

Thanks!

LDonato commented 6 years ago

Hi! Here you can find an export from dotMemory. I took the first three snapshots manually every hour to show that memory usage is stable to begin with. Then I let the program run overnight with a trigger to take a snapshot when memory would increase by 100Mb and this is the result.

Downlaod dotMemory export

I hope this can be helpful to investigate the matter.

Next I'll split the application as you suggested.

pardahlman commented 6 years ago

Thanks for the dotMemory files! This helped me isolate the problem to something related to the channel pool: for some reason the ChannelRequestQueue grows (e.g. more "things" whats to use a channels), but doesn't seem to be serving request. The way I read the code, it is either because the recovery event has not triggered for the underlying channels, or there is some execution where the "work lock" is not released.

I've added more logging as well as a try/catch around the relevant code blocks. Hopefully this will solve or give more insight in this problem.

LDonato commented 6 years ago

Thank you very much for the follow up! I'll start testing immediately!

LDonato commented 6 years ago

@pardahlman I thought I'd share the first results ... The publishing hasn't stopped yet, but I can see from the logs that the work lock thing is quite a common occurrence. Hopefully that'll be useful.

Thank you Luca

log-rawrabbit-20180312.zip

pardahlman commented 6 years ago

Thank you for this. I will look closer on the logs and correlate with the source code. Looking forward hearing from you when/if the problem occurs again.

LDonato commented 6 years ago

Hi! Great job, the system is more stable now but still it stopped working in the end, unfortunately.

I'm sending you the logs, the last messages successfully processed was timestamped 20:58:32.

logs.zip

Let me know if there's anything I can do to provide more information.

Regards Luca

LDonato commented 6 years ago

Dear @pardahlman , I collected dotMemory snapshots of rc5, hoping they'll contain some useful clues.

Download dotMemory export

Thanks Luca

pardahlman / RawRabbit

Publishing suddenly stops and channel workload increases indefinetly #303