Closed josephruscio closed 11 years ago
Naive question, but have we ruled out threading problems in these scenarios?
Quite possible, our setup delivers metrics to two different Librato accounts from different threads.
Anything I could add to our collector to collect more information?
It brings up a good point that it would make sense to make Librato::Metrics.submit()
threadsafe — why not create a new client instance for each submit?
@roidrage Are you using queues to submit the metrics. A quick test would be to add a mutex lock around access to the queue and see if the error stops occurring. To avoid holding the lock during submit you should be able to do something like the following. Would have to get @nextmat 's confirmation on that.
lock()
submit_queue = $active_queue
$active_queue = build_new_queue()
unlock()
submit_queue.submit()
Is the purpose of that code to always use a fresh queue for every submit?
On a related note: the queue object itself is only kept in a single thread, and only accessed and used from inside that thread. Every thread holds its own queue which uses its own client object.
@roidrage Ahh, if that's the case then the threading is likely not the culprit here. The code above allows a single queue that was being accessed from multiple threads to be submitted without holding the lock around the queue. The add
operations on a queue should be fast, whereas the submit
requires blocking on the network.
@roidrage @eric Can either of you capture the exception dump from one of the 400s? It should include the submitted body of the request which would be very helpful.
Did this start recently or has it been going on for a while? Queues won't actually submit an http request on #submit if they are empty, so I think the request either has to be malformed (the body would help with that), or maybe some edge case api-side?
This has been going on for quite some time as far as I can remember. Here's a dump of the most recent error: https://gist.github.com/4482586
Thanks for the dump, I have a couple examples to work from now - after discussion I'm thinking I may throw together a branch with some more debugging to see if we can get some more information. Will followup with that when available.
One of our monitoring scripts hit this last night and it definitely does not use threads. It was using an older version, 0.7.4, but the dump output looked identical.
I believe I've sorted this issue out, looks like in some cases the retry middleware was retrying requests without the proper request body (wiped by the previous request response). Interestingly it looks like the main faraday retry middleware probably has this issue as well.
Should be fixed on 1.0.3 - upgrade and re-open if this problem still exists?
@technoweenie this is the bug I was talking about, @nextmat thinks it might be an issue in the main faraday retry middleware as well?
This has now been reported by a handful of users (IIRC e.g. @roidrage, @eric, @ doctorjnupe, etc). Sporadically the gem will return a 400 indicating no measurement was included in the request, even though the request does contain a measurement.