Occasionally locked at 100% CPU usage

joelittlejohn commented 8 years ago

First, I'm going to apologise because this issue report is going to be kinda woolly :smile_cat:

We're using fink-nottle with SQS. We're finding that every now and then (say, once every couple of weeks), one of our instances seems to get stuck at 100% CPU usage and needs a restart. I'm using the hprof agent to try and take some CPU samples and see what's stuck looping here, and early indications are that this may be a problem in eulalie.

I captured some CPU samples over a short period when the problem was ongoing, here are some snapshots:

https://www.refheap.com/112686

https://www.refheap.com/112687

https://www.refheap.com/112685

I see a lot of:

eulalie issue-request
eulalie handle-result
XML parsing (related to handle-result?)
camel-snake-kebab (related to handle-result?)

I can't easily see what SQS communication is going on here because we're currently connecting via https. I'll dump some information when I can capture this - I'll need to switch to the http endpoint then wait for this to happen again :smile:

moea commented 8 years ago

Thanks for the information. Here's a couple of places where bugs could cause disproportionate numbers of requests:

Eulalie

Requests being repeatedly rejected due to signing problems caused by time disagreements.

If the client and AWS disagree excessively over the UTC timestamp, the signature is considered invalid, and the client asked to recalculate it in the light of an offset provided by the remote service. This is handled, however these states are assumed to be transient - a previous version of issue-request! interacted with an atom containing the time offset, but it didn't seem like the best move. So a consistently incorrect clock could cause a consistently elevated number of requests, up to the retry limit. The request maps can contain :max-retries and :time-offset keys to modulate this behaviour.

The number of times a request can be retried in the event of non-fatal failures is service-specific, and is set to 3 for SQS, as per the AWS SDK. If this were the bug, I'm not sure it would be manifesting as dramatically.

Some kind of pathological condition around the refreshing of instance credentials, if that's a feature you're using.

When a request is made using a credentials map which contain expired credentials , it's deferred pending a refresh. There is no retry for the credentials fetch, so that probably couldn't go full Turing by itself , and errors with the credential fetches should bubble up - but it's relatively complex behaviour, and there is scope for error.

Fink-Nottle

sqs.channeled.receive! - if the remote service is immediately & consistently returning errors, and receive! doesn't have to park on its puts, then the loop will thrash against SQS as quickly as you can take a single value from its channel. Would you notice errors coming back from receive!?

A couple of questions:

What version of Eulalie is being pulled into your project?
Are you only using Fink Nottle?
I'd be great to get any information of the API methods you're using, and whether you're using <? or something similar, to blow the stack on error - or if there's a possibility errors are being dropped on the floor.

The multimethod stuff & XML parsing is dealing with the responses from SQS, yeah. Do you have any intuition about whether these CPU spikes correspond to increased throughput, or your usage is fairly constant, and this seems like nonsensical activity?

Would it help if I pushed a snapshot with tools.logging integration? Something like that is long overdue.

joelittlejohn commented 8 years ago

Thanks a lot for going through your thoughts in so much detail! We really need a trace that shows the comms between our instance and SQS don't we :smile:

So a bit more info:

We're using fink-nottle 0.4.3 and eulalie 0.6.1
We are only using fink-nottle to communicate with SQS, yes (not sure sure I understood that question)
A quick summary of the API methods we're using:
- When our app starts, we're using fink-nottle.sqs.create-queue!! and fink-nottle.sqs.set-queue-attribute!
- To put and take messages we're using fink-nottle.sqs.channeled.batching-sends, fink-nottle.sqs.channeled.batching-deletes and fink-nottle.sqs.channeled.receive!

I don't think this will be related to system clock drift as we run ntpd, but I can't rule it out.

The CPU spikes don't correspond to increased throughput. In fact, and this may be a big clue, we're not processing messages at all when the CPU starts to climb (it sits there utilising one of the cores at 100% permanently from then on btw).

Here's a graph that shows a little of what I've described:

CPU usage

Some notes on this image:

In the lower graph (load average) you can see that all three of our instances are ticking over with low CPU usage, until one instance suddenly jumps to 50% (these instances have 2 cores).
In the upper graph, you can see when messages are actually being processed. There are none being consumed at the time that the load begins to rise.

moea commented 8 years ago

I kind of like receive! + a recurring remote error as the culprit. It's going to be the only Eulalie-related thing which is running at that point in the graph, it sounds like. How are you taking values from its channel (<!, <?, etc.)?

moea commented 8 years ago

OK, so after reading the fink-nottle code harder - it terminates on the first error. It's been a while, clearly. I'll add some logging and push a snapshot.

moea commented 8 years ago

I updated 0.6.5-SNAPSHOT to include some logging using Timbre. At :debug, all request and response maps will be printed, otherwise service+target of each req, as well as details of errors and retries. Without configuration, these will go to stdout. Ideally, excluding fink-nottle's version of eulalie and explicitly depending on 0.6.5-SNAPSHOT would be great, if you can tolerate that.

I'm pretty interested in what the problem ends up being - please let me know if there's anything I can do to make this easier to diagnose.

joelittlejohn commented 8 years ago

I'll update to the snapshot and let you know what happens :+1:

joelittlejohn commented 8 years ago

Oh, and we're taking from the channel created by receive! using <!.

joelittlejohn commented 8 years ago

I've deployed the new snapshot. I'll probably go quiet for a while as it could be a week or more before the problem manifests.

moea commented 8 years ago

Understood.

joelittlejohn commented 8 years ago

So it's been nearly a month since I deployed that snapshot. So far nothing :) I really would have expected to reproduce the problem already so lets consider what might explain this:

Some other change in our application has fixed this

I don't believe this is the case, there's been no significant change in the application between the time this was last observed and the time I deployed the new eulalie snapshot. The only change of any kind is that we introduced a go block to consume from the error channel to work around the problem described in nervous-systems/fink-nottle#11
Some other change in eulalie since 0.6.1 fixed the problem

It looks to me like the only candidates for this would be 3076a10 or 9bf1533.
This is a heisenbug - adding the logging changed the behaviour of the system enough to remove the problem

I suppose this would indicate a race condition somewhere, which seems unlikely given the fundamental nature of core.async.

I'm happy to close this and reopen if I can ever reproduce the problem.

moea commented 8 years ago

Thanks for the update - that's the worst kind of good news. Please don't hesitate to let me know if you see anything unexpected going forward. On 11 Jan 2016 09:59, "Joe Littlejohn" notifications@github.com wrote:

So it's been nearly a month since I deployed that snapshot. So far nothing :) I really would have expected to reproduce the problem already so lets consider what might explain this:

1.

Some other change in our application has fixed this

I don't believe this is the case, there's been no significant change in the application between the time this was last observed and the time I deployed the new eulalie snapshot. The only change of any kind is that we introduced a go block to consume from the error channel to work around the problem described in nervous-systems/fink-nottle#11 https://github.com/nervous-systems/fink-nottle/issues/11 2.

Some other change in eulalie since 0.6.1 fixed the problem

It looks to me like the only candidates for this would be 3076a10 https://github.com/nervous-systems/eulalie/commit/3076a10e9474870f7e824a49ea896846de4e643f or 9bf1533 https://github.com/nervous-systems/eulalie/commit/9bf153376cb79f621bc0223dc09e0da4e3da1758 . 3.

This is a heisenbug - adding the logging changed the behaviour of the system enough to remove the problem

I suppose this would indicate a race condition somewhere, which seems unlikely given the fundamental nature of core.async.

I'm happy to close this and reopen if I can ever reproduce the problem.

— Reply to this email directly or view it on GitHub https://github.com/nervous-systems/eulalie/issues/24#issuecomment-170493931 .

nervous-systems / eulalie

Occasionally locked at 100% CPU usage #24

Eulalie

Fink-Nottle