[Q] Context Canceled Errors on Large Queries to Carbonserver

cthiel42 commented 3 years ago

I have a few large queries that are causing me issues. I originally had some larger queries with a lot of wildcards that would cause me errors like this:

"error": "find failed, can't expand globs"

I increased the max globs in the config and that seemed to fix the problem for most of the larger queries, but there's still one left that was causing me issues, so I increased the max globs to an absurd value. This time I get a different error though:

ERROR [access] find failed "format": "carbonapi_v3_pb", "runtime_seconds": 4.00071045, "reason": "Internal error while processing request", "error": "could not expand globs - context canceled", "http_code": 500

INFO [carbonserver] slow_expand_globs

My question is what do I need to change to get this query to run. Skimming through the carbonserver code, it looks like this error isn't necessarily related to the value of max globs, but is more like an issue with timeouts. On our python based graphite this query takes about 8 or 9 seconds, and the timeout is set much higher than that in our config.

Our go-carbon setup is running on a 16 core box with EBS volumes for storage, and currently hovering around 8 million metrics. Here is our go-carbon config:

[common]
user = "root"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 14

[whisper]
data-dir = "/opt/graphite/storage/whisper"
schemas-file = "/opt/go-carbon/storage-schemas.conf"
aggregation-file = "/opt/go-carbon/storage-aggregation.conf"
workers = 6
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = true
enabled = true
hash-filenames = true
compressed = true
remove-empty-file = false

[cache]
max-size = 0
write-strategy = "noop"

[udp]
listen = ":2003"
enabled = true
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
buffer-size = 0

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false

[carbonserver]
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "60s"
write-timeout = "60s"
query-cache-enabled = false
query-cache-size-mb = 2000
find-cache-enabled = false
trigram-index = true
scan-frequency = "2m30s"
trie-index = false
cache-scan = false
max-globs = 100000000
fail-on-max-globs = true
max-metrics-globbed  = 100000000
max-metrics-rendered = 1000000
graphite-web-10-strict-mode = true
stats-percentiles = [99, 98, 95, 75, 50]

[pprof]
listen = "localhost:7007"
enabled = false

cthiel42 commented 3 years ago

So it looks like my timeout in carbonapi was why the query was ending at 4 seconds. I bumped that up to 30 seconds hoping that would solve the problem but that doesn't look like it's the case. Now it just runs for 30 seconds before hitting context canceled.

Any ideas on how to troubleshoot this? Like I said, this query runs for several seconds on the python version, never anywhere near 30 seconds.

deniszh commented 3 years ago

Hi @cthiel42

did you tried "fail-on-max-globs = false"? or maybe enable "trie-index" ?

cthiel42 commented 3 years ago

fail-on-max-globs = false results in the same context canceled error.

And for trie-index, I've never actually been able to get it to work. Every time I try to use it I get no metrics found error logs or info logs that say 0 files found. If my memory serves correctly, I dug into that issue a while back and it had something to do with file compression being enabled.

cthiel42 commented 3 years ago

I think I've narrowed the issue down to the expandGlobBraces function in carbonserver.go

Running that function locally on my machine takes several minutes with the input I'm throwing at it. I'm going to do some more debugging and figure out what can be done to speed things up.

bom-d-van commented 3 years ago

And for trie-index, I've never actually been able to get it to work. Every time I try to use it I get no metrics found error logs or info logs that say 0 files found. If my memory serves correctly, I dug into that issue a while back and it had something to do with file compression being enabled.

Can you share more details on this issue? We have been using trie quite some time and it's doing good.

I think I've narrowed the issue down to the expandGlobBraces function in carbonserver.go

Can you also share the number of metrics and an example of your queries look like? Would help us identify the issue.

bom-d-van commented 3 years ago

Oh, and one more thing, what's your version of go-carbon?

There should be some more info for logs in slow_expand_globs

You can set the logger to debug to have more informations.

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "debug"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = "1s"
sample-initial = 5
sample-thereafter = 100

[[logging]]
logger = "access"
file = "/var/log/go-carbon/go-carbon-access.log"
level = "debug"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"

[[logging]]
logger = "stat"
file = "/var/log/go-carbon/go-carbon.log"
level = "error"
encoding = "json"
encoding-time = "iso8601"
encoding-duration = "seconds"

cthiel42 commented 3 years ago

We're running 0.15.6.

Are you using trie with file compression enabled too? Like I said, it's been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don't have a single metric under our metrics directory.

And as far as the query goes, I think that's part of the problem. It looks like expandGlobBraces builds out all the possible metric/file paths from a query that uses curly braces. Our query looks roughly like this:

metric.{10 values}.{46 values}.{46 values}.foo.{65 values}.{423 values}.bar

By my math that means the expandGlobBraces needs to build 581 million metric paths. And then it also builds out all the metric paths with .wsp on the end so double that to get 1.163 billion metric paths.

If I pull each set of those curly braces out and put a wildcard (*) in their place, the query runs in about a second. I think building out all the metric paths is what's causing the time out here. For a little context, these queries are coming from grafana, and grafana handles multiselect variables in their queries using the curly brace notation.

bom-d-van commented 3 years ago

metric.{10 values}.{46 values}.{46 values}.foo.{65 values}.{423 values}.bar

By my math that means the expandGlobBraces needs to build 581 million metric paths. And then it also builds out all the metric paths with .wsp on the end so double that to get 1.163 billion metric paths.

Nice finding. I think trie-index might be able to avoid calling CarbonserverListener.expandGlobBraces for most of the queries.

If I pull each set of those curly braces out and put a wildcard (*) in their place, the query runs in about a second. I think building out all the metric paths is what's causing the time out here. For a little context, these queries are coming from grafana, and grafana handles multiselect variables in their queries using the curly brace notation.

I think there is an Include All option for Variables in Grafana that would uses * by default.

Are you using trie with file compression enabled too? Like I said, it's been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don't have a single metric under our metrics directory.

Yes, we do have a cluster using both compression and trie index and it's working. Would love to learn about your issue if you want to try it out again in the future.

cthiel42 commented 3 years ago

Alright I'll give trie index another go. When I run it, these are the kind of logs I get. Basically just doesn't find any metrics.

[2021-08-09T18:50:26.170Z] INFO [carbonserver] starting carbonserver {"listen": "x.x.x.x:8080", "whisperData": "/opt/graphite/storage/whisper", "maxGlobs": 100000000000, "scanFrequency": "2m30s"}
[2021-08-09T18:50:26.170Z] INFO [carbonserver] file list updated {"handler": "fileListUpdated", "trie_depth": 0, "longest_metric": "", "file_scan_runtime": 0.000013604, "indexing_runtime": 0.000000689, "rdtime_update_runtime": 0.000000254, "total_runtime": 0.00000137, "Files": 0, "index_size": 0, "pruned_trigrams": 0, "index_type": "trie", "read_from_cache": false}
[2021-08-09T18:52:56.170Z] INFO [carbonserver] file list updated {"handler": "fileListUpdated", "trie_depth": 0, "longest_metric": "", "file_scan_runtime": 0.000011512, "indexing_runtime": 0.000001666, "rdtime_update_runtime": 0.000000156, "total_runtime": 0.000002338, "Files": 0, "index_size": 0, "pruned_trigrams": 0, "index_type": "trie", "read_from_cache": false}
[2021-08-09T18:53:20.939Z] DEBUG [carbonserver] expandGlobs result {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "action": "expandGlobs", "metrics": ["*"], "result": [{"Name":"*","Files":null,"Leafs":null}]}
[2021-08-09T18:53:20.939Z] DEBUG [carbonserver] will send out response {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "response": {"metrics":[{"name":"*","matches":[]}]}}
[2021-08-09T18:53:20.939Z] INFO [access] find success {"handler": "find", "url": "/metrics/find/?format=carbonapi_v3_pb", "peer": "x.x.x.x:42080", "query": ["*"], "format": "carbonapi_v3_pb", "runtime_seconds": 0.000208737, "Files": 0, "find_cache_enabled": false, "from_cache": false, "http_code": 200}

cthiel42 commented 3 years ago

Alright I did some more troubleshooting and found that the filepath walk it does starting here returns no files. I pulled the basics of that code out and ran it on my local machine to test that it walked the directory correctly and it did. I ran that same script on the graphite server's whisper directory and I get nothing. I ran it on a different directory on the graphite server and it listed everything fine.

I have my whisper directory symlinked, and it turns out golang's walk function doesn't follow symlinked paths, as specified here

I switched my data-dir to the target of that symlink and now I'm getting metrics. Not all of them though, it looks like I'm getting some errors when the walk hits an empty folder. Any thoughts there? Maybe put some error handling in for empty folders?

bom-d-van commented 3 years ago

I switched my data-dir to the target of that symlink and now I'm getting metrics. Not all of them though, it looks like I'm getting some errors when the walk hits an empty folder. Any thoughts there? Maybe put some error handling in for empty folders?

Hmm, I think this is shared logics of both trie and trigram index. So in theory, trigram index should also have the same problem.

I don't understand why empty folders would causes issue though? Do you mean go-carbon query api doesn't return the folder?

(And thanks for the detailed report and testing!)

cthiel42 commented 3 years ago

Here is a log of the error I'm getting with the empty folders

ERROR [carbonserver] error getting file list {"handler": "fileListUpdated", "error": "updateFileList.trie: metric fileename is nil: appname-prd-shared/datacenter/webservice/appname/currentblockedbandwidthbytes/"}

bom-d-van commented 3 years ago

@cthiel42 this looks like there is whisper file with no name appname-prd-shared/datacenter/webservice/appname/currentblockedbandwidthbytes/.wsp in this path?

But go-carbon should probably just log the error and continue indexing. Will make a patch for this issue.

bom-d-van commented 3 years ago

hi @cthiel42 I have pushed a MR for the index issues you reported: https://github.com/go-graphite/go-carbon/pull/427

If you can give it a shot to confirm that it fixes your problem, it would be great!

cthiel42 commented 3 years ago

So that get's me mostly there. It finishes building the entire index and logs everything that it wasn't able to index. But the problem I have now is that the empty folders don't actually show up as empty now. For instance the totalblockedbandwidthbytes folder is empty, but I've got metrics showing up in it. It looks like the metrics are all the folders or metrics at the same level as totalblockedbandwidthbytes, but that come after totalblockedbandwidthbytes alphabetically.

I also noticed that under another empty folder called currentblockedbandwidthbytes at the same level as totalblockedbandwidthbytes, had similar behavior.

And lastly I did confirm that there is a hidden file under these empty directories called .wsp I tried deleting that file but shortly after it was recreated. I haven't been able to find any error logs for what is happening there. It does appear like all of these empty folders are coming from the same sort of metric collection agent. I'll talk with some of the teams and see if I can figure out where the origin of these metrics actually are.

Other than that, everything appears to be functioning well.

bom-d-van commented 3 years ago

Cool, thanks for checking it out!

For the empty folder, that's new to me. Not sure what's the cause. How about using graphite API:

curl http://127.0.0.1: 8080/metrics/find/?query=path.to.totalblockedbandwidthbytes.*

Does it return an expected outcome?

cthiel42 commented 3 years ago

When I do that query (with trie index turned on) it returns all the metrics you see in my screenshot. This doesn't really affect me much but it'd be nice to not have to worry about it.

If I shut trie index off, then I get the following response

{"metrics":[{"name":"path/to/totalblockedbandwidthbytes/*","matches":[{"path":"path.to.totalblockedbandwidthbytes.","isLeaf":true}]}]}

cthiel42 commented 2 years ago

@bom-d-van I wanted to follow up on this since it's causing me a few query issues in our production environment and I saw it got merged into master. I think the issue in the code is somewhere in carbonserver/trie.go, lines 633 - 651.

        if !isFile {
        if cur.dir() {
            return nil
        }
        var newDir = true
        for _, child := range *cur.childrens {
            if child.dir() {
                cur = child
                newDir = false
                break
            }
        }
        if newDir {
            cur.addChild(ti.newDir())
        }
        return nil
    }

If I change that back to what it was before the patch, all the weird behavior with empty folders is gone.

        if !isFile {
        if cur.dir() {
                        cur.addChild(&trieNode{c: []byte{'/'}, childrens: &[]*trieNode{}, gen: ti.root.gen})
                 }
                 return nil
        }

If you wanted to replicate this issue you can just run go-carbon off master, enable trie index, and put an empty folder in the whisper data directory. Off the current master branch you'll get something like this:

Note that empty_folder_test is an empty folder on the filesystem, yet the trie index is saying it contains an exact copy of everything after it alphabetically at the same level. If you switch out that bit of code that I mentioned above, the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.

deniszh commented 2 years ago

the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.

I would treat that's as a bug. Empty directory should magically show metrics which actually is not there.

the empty directory doesn't get indexed, and doesn't show up in the query, which is how I think it should be.

Well, that's debatable. If directory contains no whisper files then it should not be indexed or visible by metric definition IMO.

bom-d-van commented 2 years ago

hi @cthiel42 , I can't reproduce the issue. I wonder if it's because we are using different versions of graphite dashboard there?

I created two empty dir, one is called ddos_nil, one is called nil. They both show as empty dir in the result.

We ~~don't~~ do have cronjobs on our production periodically removing empty directories using find. So maybe that's why we don't notice empty directories.

One thing we can also do is make trie to also prune empty dir nodes.

cthiel42 commented 2 years ago

Is your graphite-web configured to query the file system directly or configured to query carbonserver? I had to set the data directory to a bad path so graphite-web wouldn't query the file system and would instead query the URL's in the cluster server configuration.

Possibly an easier alternative would be to just use carbonapi instead of graphite-web. The issue is with go-carbon so the behavior is the same there, and it's easier to manage than graphite-web.

And I have a cronjob running to clear out empty folders and stale metrics too. The problem is these will just get created again if you delete the folder since there's a metric with no name getting sent to graphite.

bom-d-van commented 2 years ago

Is your graphite-web configured to query the file system directly or configured to query carbonserver? I had to set the data directory to a bad path so graphite-web wouldn't query the file system and would instead query the URL's in the cluster server configuration.

I'm using carbonserver/go-carbon.

I pushed a quick fix here: https://github.com/go-graphite/go-carbon/pull/434

Maybe you want to see if this could fix your issue.

bom-d-van commented 2 years ago

The problem is these will just get created again if you delete the folder since there's a metric with no name getting sent to graphite

~~I think this might be the true cause.~~

Is your graphite-web configured to query the file system directly or configured to query carbonserver

actually, no. our graphite-web talks to carbonzipper. I think this might explains why I don't see the empty dir.

bom-d-van commented 2 years ago

@cthiel42 maybe try the patch that I posted above. or consider changing your graphite-web to talk to carbonzipper instead, to avoid the issues.

cthiel42 commented 2 years ago

That patch seems to get rid of the issue in my test environment. I'm going to do some more testing and maybe try it on prd overnight if everything looks good. I'll let you know how I get along.

cthiel42 commented 2 years ago

@bom-d-van I'm running that patch in our production environment now and everything seems to work as expected. I'll let you know if anything else comes up. Thanks for your help.

jdblack commented 2 years ago

Has there been an updates on this lately? We have a cluster with 3 billion known metrics running on 2x50 r5.2xlarge, and we're seeing these systems OOM out.

deniszh commented 2 years ago

@jdblack : fix was done in https://github.com/go-graphite/go-carbon/pull/445, it was merged and released.

cthiel42 commented 2 years ago

@jdblack I never had any OOM errors. I only got context canceled due to timeouts, which I was able to solve by enabling trie index. Basically without trie index, it would try to build out an absurd amount of metric paths and never finish.

This has been resolved and the patch has been merged so I'm going to close this issue.

bom-d-van commented 2 years ago

Has there been an updates on this lately? We have a cluster with 3 billion known metrics running on 2x50 r5.2xlarge, and we're seeing these systems OOM out.

@jdblack wow, 3 billion?! are they all in the same instance? or is it a distributed cluster? can you share your config and cluster layout?

go-graphite / go-carbon

[Q] Context Canceled Errors on Large Queries to Carbonserver #424