Open motyla opened 6 years ago
@Dieterbe, seems that this looks the same as old issue where opening root namespace in grafana query editor crash the nodes
for an instance that crashes like this, have you run any specific index related commands such as metric deletes? if not, this looks like there's some corrupt state in your index, specifically some child node is nil. looking at the code, it could either be the startNode or one of its children, or grandchildren. enabling debug logging would help clarify, or getting an index dump may also reveal the problem.
we could also write a custom utility that reads your index, and analyzes the index around where the problem lies, and that should reveal the problem, but that would be custom development work.
thanks @Dieterbe , No , we did not ran any index related commands and it's more likely we have a corrupted index. I think it's related to a bad data ingestion we made due to a misconfiguration at the source.
We will discuss it here this week to see how to handle.
this looks like the same problem as #770 . i have pushed some extra logging via #812 , please run that code and share the error messages you get.
@Dieterbe , here is the log output (now in the right place):
2018/01/08 14:27:52 [D] HTTP Render querying metrictank003/index/find for 1:["*"]
2018/01/08 14:27:52 [D] memory-idx: found first pattern sequence at node * pos 0
2018/01/08 14:27:52 [D] memory-idx: starting search at the root node
2018/01/08 14:27:52 [D] memory-idx: found first pattern sequence at node * pos 0
2018/01/08 14:27:52 [D] memory-idx: searching 17 children of that match *
2018/01/08 14:27:52 [D] memory-idx: Matching all children
2018/01/08 14:27:52 [D] memory-idx: starting search at the root node
2018/01/08 14:27:52 [D] memory-idx: searching 17 children of that match *
2018/01/08 14:27:52 [D] memory-idx: Matching all children
2018/01/08 14:27:52 [D] memory-idx: reached pattern length. 17 nodes matched
2018/01/08 14:27:52 [D] memory-idx: reached pattern length. 17 nodes matched
2018/01/08 14:27:52 [D] memory-idx: orgId -1 has no metrics indexed.
2018/01/08 14:27:52 [D] memory-idx: orgId -1 has no metrics indexed.
2018/01/08 14:27:52 [D] memory-idx: 17 nodes matching pattern * found
2018/01/08 14:27:52 [D] memory-idx: 17 nodes matching pattern * found
2018/01/08 14:27:52 [D] memory-idx Find: adding to path host archive id=1.3688022064380aeae3228ecdd381adc7 name=host int=60 schemaId=28 aggId=0 lastSave=1515421173
2018/01/08 14:27:52 [D] memory-idx Find: adding to path host archive id=1.3688022064380aeae3228ecdd381adc7 name=host int=60 schemaId=28 aggId=0 lastSave=1515421173
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xa68c35]
goroutine 233965 [running]:
github.com/grafana/metrictank/idx/memory.(*MemoryIdx).Find(0xc420bb74a0, 0x1, 0xc81496f845, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/circleci/.go_workspace/src/github.com/grafana/metrictank/idx/memory/memory.go:822 +0x475
github.com/grafana/metrictank/api.(*Server).findSeriesLocal(0xc4202f0460, 0x1116280, 0xcb7267d900, 0x1, 0xc51208c300, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
/home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:133 +0x432
github.com/grafana/metrictank/api.(*Server).findSeries.func1(0xc4202f0460, 0x1116280, 0xcb7267d900, 0x1, 0xc51208c300, 0x1, 0x1, 0x0, 0xc51208c7a0, 0xcbc8fd0360, ...)
/home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:76 +0x9b
created by github.com/grafana/metrictank/api.(*Server).findSeries
/home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:75 +0x44d
that log output is not as helpful as what you posted in #812 because this one doesn't show any errors. but in that ticket, we have the clue i needed:
Jan 9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [memory.go:934 find()] [E] memory-idx: grandChild is nil. o
rg=1,patt="*",i=0,pos=0,p="*",path="megaraidsas-metrics"
most likely the problem is those metrics are malformed, as mentioned in the other ticket. '.megaraidsas-metrics' (starting with a dot)
2 things need to happen: 1) stop sending the badly formed ones. then, delete the bad entries from the index and restart MT. let me know if you need further assistance 2) we should not accept these metrics. how did you ingest them? via kafka-mdm or via carbon?
seems like we had this same issue before (#668) and fixed it via #694 but that was specifically carbon, so please let me know if you used carbon or kafka, and in kafka, how do you put them into kafka? where are these metrics being generated and/or transformed?
We push metrics directly to Kafka and also through carbon-relay-ng. While getting the patch from you we used it to prevent crashes and we also started to put safeties where needed. In our code that write directly and with blacklists in carbon.
From what I saw , bad metrics came from both directions
This is my carbon-relay-ng blacklist:
'regex .*\.$',
'regex .*\.\..*',
'regex ^\..*',
good news. with https://github.com/raintank/fakemetrics/pull/9 i was able to trivially reproduce.
sending metric with name ".foo.bar"
and trying to browse in grafana results in the same error:
metrictank_1 | [Macaron] 2018-08-21 13:16:25: Started GET /metrics/find?from=1534835724&query=*&until=1534857444 for 172.18.0.1, 172.18.0.1, 172.18.0.10
metrictank_1 | [Macaron] 2018-08-21 13:16:25: Started POST /index/find for 127.0.0.1
metrictank_1 | [Macaron] 2018-08-21 13:16:25: Completed /index/find 500 Internal Server Error in 445.8µs
metrictank_1 | 2018/08/21 13:16:25 [memory.go:1011 find()] [E] memory-idx: grandChild is nil. org=1,patt="*",i=0,pos=0,p="*",path="foo"
metrictank_1 | [Macaron] 2018-08-21 13:16:25: Completed /metrics/find?from=1534835724&query=*&until=1534857444 200 OK in 1.258231ms
metrictank_1 | 2018/08/21 13:16:25 [asm_amd64.s:2361 goexit()] [E] HTTP Render error querying default/index/find: "500 Internal Server Error"
our options are: 1) graciously allow: in MT adjust incoming data accordingly. 2) invalidate in MT, but that also means invalidate in tsdb-gw; in carbon-relay-ng we can strip it when converting from carbon to metrics2.0, this would be ideal, but unrealistic as we can't expect everyone to upgrade their carbon-relay-ng, so MT will just have to adjust the data.
in this process i also found out not only does graphite allow 1 leading dot, it seems to allow an unlimited amount, and will just chomp them all of. it also reduces sequences of dots in the middle of metric keys, so the previous fix for carbon (#694) will have to be improved.
creating a new ticket, #1008 to track this
version : 0.7.4_633_gd5ca2bcb-1