Open wolfeidau opened 11 years ago
Yikes. Would love to know what is causing the panic for sure!
I would love to accept a PR that improves l2met's own visibility. Maybe there is something in the log pkg that we can use.
Just an update, runs for a few minutes on production data before crashing.
Using the log package I added timestamps to the log and got a second to check router messages within, this resulted in between 4 - 20 candidates.
Ran like 40 message strings through the receiver unit test and couldn't get it to die.
So I commented out this line https://github.com/ryandotsmith/l2met/blob/master/bucket/parser.go#L72 and it has been running for about 6hrs.
Will dig into the log package and see if there is a nice way of capturing the tuple with it's state and dumping this on a panic then do some tests with parallel file of log data.
So did some research, one obvious issue so far is the string convert of the Float64 only caters 10 digits, when your handling values which are in bytes and milliseconds this can overflow.
Don't ask me how but we have clients on very slow links uploading files for a very long time.
Either way I have had a shot at rewriting that logic and removed the parse/format/compare/slice with a TrimFunc, once I am happy with the code I will make a pull request for you to review. Code I am talking about is here.
https://github.com/ryandotsmith/l2met/blob/master/bucket/log_tuple.go#L56-L65
Note this also marginally faster.
:clap:
Good find. Also, I like the idea of using TrimFunc.
Hello, I wanted to follow up on this. A few things have changed with respect to internally logged metrics. You can read about the metchan in #82, but I thought it would be nice if I pointed out that we can easily accommodate the issue you have raised here with metchan. Currently all logging output happens here. We might be able to use a better logging facility that would support logging timestamps.
Ah great to hear, so far I haven't experienced this failure with my fork which includes the changes I pushed to you.
I need to re-sync up with your tree and catch up on the changes looks like lots of things are happening.
I will take a look at how this can be extended to cater for local logging.
There seems to be an issue with something in the router messages which is triggering a panic.
Because I have no timestamps in my log it is difficult to correlate with the raw syslog data I am collecting.
I understand this is not necessary if deployed on Heroku, would nice to add a flag to enable it for development and where not hosted there.
Happy to submit a pull request if your conducive to the idea.
Also here is the error message logged.