librato / librato-metrics

Ruby wrapper to make it easy to interact with Librato's API.
https://librato.com
Other
108 stars 52 forks source link

Sporadic "Must contain a measurement" errors #58

Closed josephruscio closed 11 years ago

josephruscio commented 11 years ago

This has now been reported by a handful of users (IIRC e.g. @roidrage, @eric, @ doctorjnupe, etc). Sporadically the gem will return a 400 indicating no measurement was included in the request, even though the request does contain a measurement.

mheffner commented 11 years ago

Naive question, but have we ruled out threading problems in these scenarios?

roidrage commented 11 years ago

Quite possible, our setup delivers metrics to two different Librato accounts from different threads.

Anything I could add to our collector to collect more information?

eric commented 11 years ago

It brings up a good point that it would make sense to make Librato::Metrics.submit() threadsafe — why not create a new client instance for each submit?

mheffner commented 11 years ago

@roidrage Are you using queues to submit the metrics. A quick test would be to add a mutex lock around access to the queue and see if the error stops occurring. To avoid holding the lock during submit you should be able to do something like the following. Would have to get @nextmat 's confirmation on that.

lock()
submit_queue = $active_queue
$active_queue = build_new_queue()
unlock()
submit_queue.submit()
roidrage commented 11 years ago

Is the purpose of that code to always use a fresh queue for every submit?

On a related note: the queue object itself is only kept in a single thread, and only accessed and used from inside that thread. Every thread holds its own queue which uses its own client object.

mheffner commented 11 years ago

@roidrage Ahh, if that's the case then the threading is likely not the culprit here. The code above allows a single queue that was being accessed from multiple threads to be submitted without holding the lock around the queue. The add operations on a queue should be fast, whereas the submit requires blocking on the network.

nextmat commented 11 years ago

@roidrage @eric Can either of you capture the exception dump from one of the 400s? It should include the submitted body of the request which would be very helpful.

Did this start recently or has it been going on for a while? Queues won't actually submit an http request on #submit if they are empty, so I think the request either has to be malformed (the body would help with that), or maybe some edge case api-side?

roidrage commented 11 years ago

This has been going on for quite some time as far as I can remember. Here's a dump of the most recent error: https://gist.github.com/4482586

nextmat commented 11 years ago

Thanks for the dump, I have a couple examples to work from now - after discussion I'm thinking I may throw together a branch with some more debugging to see if we can get some more information. Will followup with that when available.

mheffner commented 11 years ago

One of our monitoring scripts hit this last night and it definitely does not use threads. It was using an older version, 0.7.4, but the dump output looked identical.

nextmat commented 11 years ago

I believe I've sorted this issue out, looks like in some cases the retry middleware was retrying requests without the proper request body (wiped by the previous request response). Interestingly it looks like the main faraday retry middleware probably has this issue as well.

Should be fixed on 1.0.3 - upgrade and re-open if this problem still exists?

josephruscio commented 11 years ago

@technoweenie this is the bug I was talking about, @nextmat thinks it might be an issue in the main faraday retry middleware as well?