Closed cthiel42 closed 2 years ago
So it looks like my timeout in carbonapi was why the query was ending at 4 seconds. I bumped that up to 30 seconds hoping that would solve the problem but that doesn't look like it's the case. Now it just runs for 30 seconds before hitting context canceled.
Any ideas on how to troubleshoot this? Like I said, this query runs for several seconds on the python version, never anywhere near 30 seconds.
Hi @cthiel42
did you tried "fail-on-max-globs = false"? or maybe enable "trie-index" ?
fail-on-max-globs = false
results in the same context canceled error.
And for trie-index
, I've never actually been able to get it to work. Every time I try to use it I get no metrics found
error logs or info logs that say 0 files found. If my memory serves correctly, I dug into that issue a while back and it had something to do with file compression being enabled.
I think I've narrowed the issue down to the expandGlobBraces
function in carbonserver.go
Running that function locally on my machine takes several minutes with the input I'm throwing at it. I'm going to do some more debugging and figure out what can be done to speed things up.
And for trie-index, I've never actually been able to get it to work. Every time I try to use it I get no metrics found error logs or info logs that say 0 files found. If my memory serves correctly, I dug into that issue a while back and it had something to do with file compression being enabled.
Can you share more details on this issue? We have been using trie quite some time and it's doing good.
I think I've narrowed the issue down to the expandGlobBraces function in carbonserver.go
Can you also share the number of metrics and an example of your queries look like? Would help us identify the issue.
Oh, and one more thing, what's your version of go-carbon?
There should be some more info for logs in slow_expand_globs
You can set the logger to debug to have more informations.
[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "debug"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = "1s"
sample-initial = 5
sample-thereafter = 100
[[logging]]
logger = "access"
file = "/var/log/go-carbon/go-carbon-access.log"
level = "debug"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"
[[logging]]
logger = "stat"
file = "/var/log/go-carbon/go-carbon.log"
level = "error"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"
We're running 0.15.6.
Are you using trie with file compression enabled too? Like I said, it's been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don't have a single metric under our metrics directory.
And as far as the query goes, I think that's part of the problem. It looks like expandGlobBraces
builds out all the possible metric/file paths from a query that uses curly braces. Our query looks roughly like this:
metric.{10 values}.{46 values}.{46 values}.foo.{65 values}.{423 values}.bar
By my math that means the expandGlobBraces
needs to build 581 million metric paths. And then it also builds out all the metric paths with .wsp
on the end so double that to get 1.163 billion metric paths.
If I pull each set of those curly braces out and put a wildcard (*
) in their place, the query runs in about a second. I think building out all the metric paths is what's causing the time out here. For a little context, these queries are coming from grafana, and grafana handles multiselect variables in their queries using the curly brace notation.
metric.{10 values}.{46 values}.{46 values}.foo.{65 values}.{423 values}.bar
By my math that means the expandGlobBraces needs to build 581 million metric paths. And then it also builds out all the metric paths with .wsp on the end so double that to get 1.163 billion metric paths.
Nice finding. I think trie-index might be able to avoid calling CarbonserverListener.expandGlobBraces
for most of the queries.
If I pull each set of those curly braces out and put a wildcard (*) in their place, the query runs in about a second. I think building out all the metric paths is what's causing the time out here. For a little context, these queries are coming from grafana, and grafana handles multiselect variables in their queries using the curly brace notation.
I think there is an Include All option
for Variables in Grafana that would uses *
by default.
Are you using trie with file compression enabled too? Like I said, it's been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don't have a single metric under our metrics directory.
Yes, we do have a cluster using both compression and trie index and it's working. Would love to learn about your issue if you want to try it out again in the future.
Alright I'll give trie index another go. When I run it, these are the kind of logs I get. Basically just doesn't find any metrics.
[2021-08-09T18:50:26.170Z] INFO [carbonserver] starting carbonserver {"listen": "x.x.x.x:8080", "whisperData": "/opt/graphite/storage/whisper", "maxGlobs": 100000000000, "scanFrequency": "2m30s"}
[2021-08-09T18:50:26.170Z] INFO [carbonserver] file list updated {"handler": "fileListUpdated", "trie_depth": 0, "longest_metric": "", "file_scan_runtime": 0.000013604, "indexing_runtime": 0.000000689, "rdtime_update_runtime": 0.000000254, "total_runtime": 0.00000137, "Files": 0, "index_size": 0, "pruned_trigrams": 0, "index_type": "trie", "read_from_cache": false}
[2021-08-09T18:52:56.170Z] INFO [carbonserver] file list updated {"handler": "fileListUpdated", "trie_depth": 0, "longest_metric": "", "file_scan_runtime": 0.000011512, "indexing_runtime": 0.000001666, "rdtime_update_runtime": 0.000000156, "total_runtime": 0.000002338, "Files": 0, "index_size": 0, "pruned_trigrams": 0, "index_type": "trie", "read_from_cache": false}
[2021-08-09T18:53:20.939Z] DEBUG [carbonserver] expandGlobs result {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "action": "expandGlobs", "metrics": ["*"], "result": [{"Name":"*","Files":null,"Leafs":null}]}
[2021-08-09T18:53:20.939Z] DEBUG [carbonserver] will send out response {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "response": {"metrics":[{"name":"*","matches":[]}]}}
[2021-08-09T18:53:20.939Z] INFO [access] find success {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "runtime_seconds": 0.000208737, "Files": 0, "find_cache_enabled": false, "from_cache": false, "http_code": 200}
Alright I did some more troubleshooting and found that the filepath walk it does starting here returns no files. I pulled the basics of that code out and ran it on my local machine to test that it walked the directory correctly and it did. I ran that same script on the graphite server's whisper directory and I get nothing. I ran it on a different directory on the graphite server and it listed everything fine.
I have my whisper directory symlinked, and it turns out golang's walk function doesn't follow symlinked paths, as specified here
I switched my data-dir to the target of that symlink and now I'm getting metrics. Not all of them though, it looks like I'm getting some errors when the walk hits an empty folder. Any thoughts there? Maybe put some error handling in for empty folders?
I switched my data-dir to the target of that symlink and now I'm getting metrics. Not all of them though, it looks like I'm getting some errors when the walk hits an empty folder. Any thoughts there? Maybe put some error handling in for empty folders?
Hmm, I think this is shared logics of both trie and trigram index. So in theory, trigram index should also have the same problem.
I don't understand why empty folders would causes issue though? Do you mean go-carbon query api doesn't return the folder?
(And thanks for the detailed report and testing!)
Here is a log of the error I'm getting with the empty folders
ERROR [carbonserver] error getting file list {"handler": "fileListUpdated", "error": "updateFileList.trie: metric fileename is nil: appname-prd-shared/datacenter/webservice/appname/currentblockedbandwidthbytes/"}
@cthiel42 this looks like there is whisper file with no name appname-prd-shared/datacenter/webservice/appname/currentblockedbandwidthbytes/.wsp
in this path?
But go-carbon should probably just log the error and continue indexing. Will make a patch for this issue.
hi @cthiel42 I have pushed a MR for the index issues you reported: https://github.com/go-graphite/go-carbon/pull/427
If you can give it a shot to confirm that it fixes your problem, it would be great!
So that get's me mostly there. It finishes building the entire index and logs everything that it wasn't able to index. But the problem I have now is that the empty folders don't actually show up as empty now. For instance the totalblockedbandwidthbytes folder is empty, but I've got metrics showing up in it. It looks like the metrics are all the folders or metrics at the same level as totalblockedbandwidthbytes, but that come after totalblockedbandwidthbytes alphabetically.
I also noticed that under another empty folder called currentblockedbandwidthbytes at the same level as totalblockedbandwidthbytes, had similar behavior.
And lastly I did confirm that there is a hidden file under these empty directories called .wsp I tried deleting that file but shortly after it was recreated. I haven't been able to find any error logs for what is happening there. It does appear like all of these empty folders are coming from the same sort of metric collection agent. I'll talk with some of the teams and see if I can figure out where the origin of these metrics actually are.
Other than that, everything appears to be functioning well.
Cool, thanks for checking it out!
For the empty folder, that's new to me. Not sure what's the cause. How about using graphite API:
curl http://127.0.0.1: 8080/metrics/find/?query=path.to.totalblockedbandwidthbytes.*
Does it return an expected outcome?
When I do that query (with trie index turned on) it returns all the metrics you see in my screenshot. This doesn't really affect me much but it'd be nice to not have to worry about it.
If I shut trie index off, then I get the following response
{"metrics":[{"name":"path/to/totalblockedbandwidthbytes/*","matches":[{"path":"path.to.totalblockedbandwidthbytes.","isLeaf":true}]}]}
@bom-d-van I wanted to follow up on this since it's causing me a few query issues in our production environment and I saw it got merged into master. I think the issue in the code is somewhere in carbonserver/trie.go, lines 633 - 651.
if !isFile {
if cur.dir() {
return nil
}
var newDir = true
for _, child := range *cur.childrens {
if child.dir() {
cur = child
newDir = false
break
}
}
if newDir {
cur.addChild(ti.newDir())
}
return nil
}
If I change that back to what it was before the patch, all the weird behavior with empty folders is gone.
if !isFile {
if cur.dir() {
cur.addChild(&trieNode{c: []byte{'/'}, childrens: &[]*trieNode{}, gen: ti.root.gen})
}
return nil
}
If you wanted to replicate this issue you can just run go-carbon off master, enable trie index, and put an empty folder in the whisper data directory. Off the current master branch you'll get something like this:
Note that empty_folder_test
is an empty folder on the filesystem, yet the trie index is saying it contains an exact copy of everything after it alphabetically at the same level. If you switch out that bit of code that I mentioned above, the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.
the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.
I would treat that's as a bug. Empty directory should magically show metrics which actually is not there.
the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.
Well, that's debatable. If directory contains no whisper files then it should not be indexed or visible by metric definition IMO.
hi @cthiel42 , I can't reproduce the issue. I wonder if it's because we are using different versions of graphite dashboard there?
I created two empty dir, one is called ddos_nil, one is called nil. They both show as empty dir in the result.
We don't do have cronjobs on our production periodically removing empty directories using find. So maybe that's why we don't notice empty directories.
One thing we can also do is make trie to also prune empty dir nodes.
Is your graphite-web configured to query the file system directly or configured to query carbonserver? I had to set the data directory to a bad path so graphite-web wouldn't query the file system and would instead query the URL's in the cluster server configuration.
Possibly an easier alternative would be to just use carbonapi instead of graphite-web. The issue is with go-carbon so the behavior is the same there, and it's easier to manage than graphite-web.
And I have a cronjob running to clear out empty folders and stale metrics too. The problem is these will just get created again if you delete the folder since there's a metric with no name getting sent to graphite.
Is your graphite-web configured to query the file system directly or configured to query carbonserver? I had to set the data directory to a bad path so graphite-web wouldn't query the file system and would instead query the URL's in the cluster server configuration.
I'm using carbonserver/go-carbon.
I pushed a quick fix here: https://github.com/go-graphite/go-carbon/pull/434
Maybe you want to see if this could fix your issue.
The problem is these will just get created again if you delete the folder since there's a metric with no name getting sent to graphite
I think this might be the true cause.
Is your graphite-web configured to query the file system directly or configured to query carbonserver
actually, no. our graphite-web talks to carbonzipper. I think this might explains why I don't see the empty dir.
@cthiel42 maybe try the patch that I posted above. or consider changing your graphite-web to talk to carbonzipper instead, to avoid the issues.
That patch seems to get rid of the issue in my test environment. I'm going to do some more testing and maybe try it on prd overnight if everything looks good. I'll let you know how I get along.
@bom-d-van I'm running that patch in our production environment now and everything seems to work as expected. I'll let you know if anything else comes up. Thanks for your help.
Has there been an updates on this lately? We have a cluster with 3 billion known metrics running on 2x50 r5.2xlarge, and we're seeing these systems OOM out.
@jdblack : fix was done in https://github.com/go-graphite/go-carbon/pull/445, it was merged and released.
@jdblack I never had any OOM errors. I only got context canceled due to timeouts, which I was able to solve by enabling trie index. Basically without trie index, it would try to build out an absurd amount of metric paths and never finish.
This has been resolved and the patch has been merged so I'm going to close this issue.
Has there been an updates on this lately? We have a cluster with 3 billion known metrics running on 2x50 r5.2xlarge, and we're seeing these systems OOM out.
@jdblack wow, 3 billion?! are they all in the same instance? or is it a distributed cluster? can you share your config and cluster layout?
I have a few large queries that are causing me issues. I originally had some larger queries with a lot of wildcards that would cause me errors like this:
I increased the max globs in the config and that seemed to fix the problem for most of the larger queries, but there's still one left that was causing me issues, so I increased the max globs to an absurd value. This time I get a different error though:
My question is what do I need to change to get this query to run. Skimming through the carbonserver code, it looks like this error isn't necessarily related to the value of max globs, but is more like an issue with timeouts. On our python based graphite this query takes about 8 or 9 seconds, and the timeout is set much higher than that in our config.
Our go-carbon setup is running on a 16 core box with EBS volumes for storage, and currently hovering around 8 million metrics. Here is our go-carbon config: