Closed tehlers320 closed 7 years ago
Experiencing the same issue while trying to test the limits of metrictank. Not sure if it's due to corruption, but in my case it's happening while testing out metrictank limits and each tank instance (6 total in our cluster) has ~3.5 million metrics in the index.
Only the master dies in my case.
Ended up adding a nil check right before the offending line and rolling it out to my cluster. That seems to catch the symptom, but not the problem.
In my case, the problem came from a metric with a leading .
(e.g. .some.metric.name
)
Issue: https://github.com/raintank/metrictank/blob/master/idx/memory/memory.go#L401
The code here assumes that node 0 doesn't have a .
prefix, but in the above example some
is apparently node 0, then line 404 looks up some
rather than .some
and just adds that in without a check.
Idk if there are other scenarios where this can happen, but changing that block to do the lookup and continue if doesn't find the node (maybe with a warning) seems like a reasonable workaround.
@shanson7 and @tehlers320 can you guys try out https://github.com/raintank/metrictank/pull/694 seems to fix it for me. note : only works with carbon input for now. my suggestion:
It seems like #694 won't prevent crashing from already indexed data. Let me spin it up and test locally.
that's correct. is wiping index and starting over an option for you? otherwise we need to think of ways to upgrade/clean up the live index.
@Dieterbe This is how i resolved this issue for myself. I just wiped the index in cassandra.
I can wipe the index. It should all just come back anyhow (minus the invalid metrics)
I tried to reproduce this with the docker-cluster and a simple bash script but i'm not seeing the issue re-appear.
#!/bin/bash
PORT=2300
echo "foo.metric.control 1 `date +%s`" | nc localhost $PORT
echo "foo.metric.badvalue1 123df `date +%s`" | nc localhost $PORT
echo "foo.metric.badtimestamp 1 123df" | nc localhost $PORT
echo "foo.metric.nil nil `date +%s`" | nc localhost $PORT
echo "foo..metric.badtree1 1 `date +%s`" | nc localhost $PORT
echo ".foo.metric.badtree2 1 `date +%s`" | nc localhost $PORT
curl -s -H "X-Org-Id: 1" "http://localhost:6060/metrics/index.json" | python -m json.tool
With 0.7.3-63-g159320c
./testme.sh
[
".foo.metric.badtree2",
"foo..metric.badtree1",
"foo.metric.control"
]
With the the proposed patch on my own docker build: 0.7.3-65-g73cd8c5
[
".foo.metric.badtree2",
"foo..metric.badtree1",
"foo.metric.control"
]
Im sorry i must be missing something.
I figured this out while fiddling with the whisper-writer. I was checking to see if it was writing to the index in a unique... way... i changed the partitions to get a new table entry thinking that i was clever.
By having 2 entries in the index with different partitions but on the same key MT crashes as mentioned in this ticket.
Here is what my table looks like after adding everything to partition 0 on an import:
select * from metrictank.metric_idx where id = '1.fcc5e1772ffe25a18dc412e5a06afd43' ALLOW FILTERING;
partition | id | interval | lastupdate | metric | mtype | name | orgid | tags | unit
-----------+------------------------------------+----------+------------+------------------------------------+-------+------------------------------------+-------+------+---------
0 | 1.fcc5e1772ffe25a18dc412e5a06afd43 | 10 | 1501747200 | statsd.ip-10-1-1-1.b.numStats | gauge | statsd.ip-10-1-1-1.b.numStats | 1 | null | unknown
44 | 1.fcc5e1772ffe25a18dc412e5a06afd43 | 10 | 1501747200 | statsd.ip-10-1-1-1.b.numStats | gauge | statsd.ip-10-1-1-1.b.numStats | 1 | null | unknown
However a crash did not occur, here is my error output. Did you implement something to recover from a crash or is this a separate bug?
14:43:40
[Macaron] 2017-08-03 14:43:40: Started POST /index/find for 192.168.240.31
14:43:40
[Macaron] PANIC: runtime error: invalid memory address or nil pointer dereference
14:43:40
/usr/local/go/src/runtime/panic.go:489 (0x42a57f)
14:43:40
/usr/local/go/src/runtime/panic.go:63 (0x42942e)
14:43:40
/usr/local/go/src/runtime/signal_unix.go:290 (0x43f8ff)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/idx/memory/memory.go:291 (0x9cdfb9)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/api/cluster.go:64 (0x9b7728)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/api/routes.go:33 (0x9ca3cb)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:515 (0x457108)
14:43:40
/usr/local/go/src/reflect/value.go:434 (0x4a33ff)
14:43:40
/usr/local/go/src/reflect/value.go:302 (0x4a29c4)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0x93304f)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0x932a1a)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:113 (0x9504a2)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:104 (0x9503c6)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/api/middleware/stats.go:72 (0x96e9b7)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:514 (0x457088)
14:43:40
/usr/local/go/src/reflect/value.go:434 (0x4a33ff)
14:43:40
/usr/local/go/src/reflect/value.go:302 (0x4a29c4)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0x93304f)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0x932a1a)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:113 (0x9504a2)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:104 (0x9503c6)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/raintank/gziper/gzip.go:73 (0x9b4fcd)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:514 (0x457088)
14:43:40
/usr/local/go/src/reflect/value.go:434 (0x4a33ff)
14:43:40
/usr/local/go/src/reflect/value.go:302 (0x4a29c4)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0x93304f)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0x932a1a)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:113 (0x9504a2)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:104 (0x9503c6)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/recovery.go:161 (0x961dcb)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:514 (0x457088)
14:43:40
/usr/local/go/src/reflect/value.go:434 (0x4a33ff)
14:43:40
/usr/local/go/src/reflect/value.go:302 (0x4a29c4)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0x93304f)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0x932a1a)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:113 (0x9504a2)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:104 (0x9503c6)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/logger.go:43 (0x961084)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:514 (0x457088)
14:43:40
/usr/local/go/src/reflect/value.go:434 (0x4a33ff)
14:43:40
/usr/local/go/src/reflect/value.go:302 (0x4a29c4)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0x93304f)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0x932a1a)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/context.go:113 (0x9504a2)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/router.go:184 (0x963099)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/router.go:286 (0x95c53d)
14:43:40
/home/ubuntu/.go_workspace/src/github.com/raintank/metrictank/vendor/gopkg.in/macaron.v1/macaron.go:177 (0x95490c)
14:43:40
/usr/local/go/src/net/http/server.go:2568 (0x6db362)
14:43:40
/usr/local/go/src/net/http/server.go:1825 (0x6d7732)
14:43:40
/usr/local/go/src/runtime/asm_amd64.s:2197 (0x4597d1)
Ok, I tried this patch out and it works as described.
echo ".bad.test.metric.whatevs 4 date +%s
" | nc -c localhost 2003
Ends up as bad.test.metric.whatevs
and nothing crashes.
👍
(Sorry for the delay)
@tehlers320 don't modify the index like that. an index entry should only live in 1 partition at a time.
thanks for testing guys. fix is now merged into master.
Version:0.7.2-12-gf9f4389 Query performed:
curl metrictank.test.monitoring.internal.com/metrics/find?query=*
Note... if this is not what builds the base tree the python UI also is failing. Grafana can grab index entries excluding the base entry, you can however fill it out manually and the 2nd/3rd/4th/5th levels work.All 16 master and all 16 slaves crash in this example and it is repeatable but only once the restarted masters are ready.