influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.57k stars 3.53k forks source link

panic - fatal error: concurrent map read and map write #8134

Closed mattheworiordan closed 7 years ago

mattheworiordan commented 7 years ago

Bug report

System info: Influx 1.1.1 Ubuntu 14.04 Run in a Docker container v1.10.3

Steps to reproduce:

Run a DELETE query on a measurement with very high cardinality. As there is still very little tooling to fix high cardinality problems I am forced to try and delete data that is causing this problem.

Note this appears to be a similar issue to https://github.com/influxdata/influxdb/issues/6720, however that issue was closed and marked as fixed.

Expected behavior:

No panic!

Actual behavior:

Panic.

Additional info:

[query] 2017/03/14 19:17:47 DELETE FROM client_visible_error_rate WHERE scope_type = 'conn'
fatal error: concurrent map read and map write

goroutine 736226 [running]:
runtime.throw(0xa8ff69, 0x21)
        /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc704e63958 sp=0xc704e63938
runtime.mapaccess1_fast64(0x9cc040, 0xc422d4d440, 0x17f0f6, 0xc5781848e8)
        /usr/local/go/src/runtime/hashmap_fast.go:111 +0x1a7 fp=0xc704e63980 sp=0xc704e63958
github.com/influxdata/influxdb/tsdb.(*Store).DeleteSeries(0xc4201cac00, 0xca40053cca, 0xa, 0xc731730310, 0x1, 0x1, 0xde1700, 0xcaf4aeccc0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/store.go:726 +0x297 fp=0xc704e63a88 sp=0xc704e63980
github.com/influxdata/influxdb/coordinator.LocalTSDBStore.DeleteSeries(0xc4201cac00, 0xca40053cca, 0xa, 0xc731730310, 0x1, 0x1, 0xde1700, 0xcaf4aeccc0, 0xc420269310, 0xc5161ed040)
        <autogenerated>:99 +0x97 fp=0xc704e63ae8 sp=0xc704e63a88
github.com/influxdata/influxdb/coordinator.(*StatementExecutor).executeDeleteSeriesStatement(0xc42026a660, 0xcaf4aec9c0, 0xca40053cca, 0xa, 0xd00, 0x5c)
        /root/go/src/github.com/influxdata/influxdb/coordinator/statement_executor.go:305 +0x17b fp=0xc704e63b68 sp=0xc704e63ae8
github.com/influxdata/influxdb/coordinator.(*StatementExecutor).ExecuteStatement(0xc42026a660, 0xde4a80, 0xcaf4aec9c0, 0x0, 0x460, 0xc87ecf4420, 0xc87ecf4300, 0xc420269310, 0xc87ecf4360, 0xca40053cca, ...)
        /root/go/src/github.com/influxdata/influxdb/coordinator/statement_executor.go:89 +0x196a fp=0xc704e63c50 sp=0xc704e63b68
github.com/influxdata/influxdb/influxql.(*QueryExecutor).executeQuery(0xc420256ea0, 0xcd4e9527c0, 0xca40053cca, 0xa, 0x2710, 0x1, 0x0, 0x0, 0xc87ecf4240, 0xc87ecf41e0, ...)
        /root/go/src/github.com/influxdata/influxdb/influxql/query_executor.go:314 +0x575 fp=0xc704e63f48 sp=0xc704e63c50
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc704e63f50 sp=0xc704e63f48
created by github.com/influxdata/influxdb/influxql.(*QueryExecutor).ExecuteQuery
        /root/go/src/github.com/influxdata/influxdb/influxql/query_executor.go:209 +0xb5

....

More detailed log at https://gist.github.com/mattheworiordan/434b7ec61264a075ff178ef50ab91d91

jwilder commented 7 years ago

Looks like this might still exist in 1.2. The equivalent line from 1.1 is https://github.com/influxdata/influxdb/blob/master/tsdb/store.go#L740. m.seriesByID map being accessed outside of lock.

jwilder commented 7 years ago

@mattheworiordan For your high cardinality issue, you can run influx_inspect report -detailed /data/path/to/shard to get some cardinality estimates on the series, tags and values in that shard.

That might help you track down what series or tags are problematic.

jeremytho commented 7 years ago

I encountered this issue in Influx 1.2 as well. For what it's worth as a testing point, it seems to only occur if I try to view metrics using Grafana while they are being written.

Compiled using Go 1.7.3. @jwilder I have a pretty easy testable and reproducible scenario, is there any information I might provide you to help you track this down? (The caveat it is on a private server and cannot provide direct access, but I will gather any logs or influx info you may need.)

edit To update this item, I have ruled out Grafana. Instead, it looks to be a certain query from a piece of software we have written using the influxdb-java library. I will investigate and try to isolate the specific query. Grafana is interacting with it quite well as long as our code is not running at the same time.

claytondukes commented 7 years ago

Hi, we are seeing this exact issue as well.

[httpd] 127.0.0.1 - lz5 [10/Jul/2017:19:19:03 +0000] "GET /query?db=lz5aggregates&q=DELETE+FROM+%22aggregates%22+WHERE+time+%3C+1498996800000000000 HTTP/1.1" 200 62 "-" "python-requests/2.12.1" a2d7be3f-65a4-11
e7-8477-000000000000 8463166
[I] 2017-07-10T19:19:12Z DELETE FROM aggregates WHERE time < 1499000400000000000 service=query
fatal error: concurrent map read and map write

goroutine 3052047 [running]:
runtime.throw(0xace810, 0x21)
        /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc42004b970 sp=0xc42004b950
runtime.mapaccess1_fast64(0xa06a20, 0xc4211081e0, 0x67b4, 0xc435fd8b18)
        /usr/local/go/src/runtime/hashmap_fast.go:111 +0x1a7 fp=0xc42004b998 sp=0xc42004b970
github.com/influxdata/influxdb/tsdb.(*Store).DeleteSeries(0xc42000bc00, 0xc434b05010, 0xd, 0xc42936a860, 0x1, 0x1, 0xe37ba0, 0xc434c78c60, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/store.go:740 +0x297 fp=0xc42004baa0 sp=0xc42004b998
github.com/influxdata/influxdb/coordinator.LocalTSDBStore.DeleteSeries(0xc42000bc00, 0xc434b05010, 0xd, 0xc42936a860, 0x1, 0x1, 0xe37ba0, 0xc434c78c60, 0x0, 0x0)
        <autogenerated>:107 +0x97 fp=0xc42004bb00 sp=0xc42004baa0
github.com/influxdata/influxdb/coordinator.(*StatementExecutor).executeDeleteSeriesStatement(0xc4201d82a0, 0xc42a3c0bd0, 0xc434b05010, 0xd, 0xe59940, 0x0)
        /root/go/src/github.com/influxdata/influxdb/coordinator/statement_executor.go:317 +0x17b fp=0xc42004bb80 sp=0xc42004bb00
github.com/influxdata/influxdb/coordinator.(*StatementExecutor).ExecuteStatement(0xc4201d82a0, 0xe3b0a0, 0xc42a3c0bd0, 0x0, 0x2205c, 0xc425ac4cc0, 0xc436ff8ea0, 0xe3eb40, 0xc4201ec900, 0xc425ac4ba0, ...)
        /root/go/src/github.com/influxdata/influxdb/coordinator/statement_executor.go:95 +0x1960 fp=0xc42004bc70 sp=0xc42004bb80
github.com/influxdata/influxdb/influxql.(*QueryExecutor).executeQuery(0xc4201c1200, 0xc457acba80, 0xc434b05010, 0xd, 0x2710, 0x1, 0x0, 0x0, 0xc436ff8de0, 0xc436ff8d80, ...)
        /root/go/src/github.com/influxdata/influxdb/influxql/query_executor.go:312 +0x55d fp=0xc42004bf58 sp=0xc42004bc70
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42004bf60 sp=0xc42004bf58
created by github.com/influxdata/influxdb/influxql.(*QueryExecutor).ExecuteQuery
        /root/go/src/github.com/influxdata/influxdb/influxql/query_executor.go:207 +0xb5

goroutine 1 [chan receive, 4158 minutes]:
main.(*Main).Run(0xc420052bc0, 0xc42000c0b0, 0x4, 0x4, 0x0, 0xed0f1fb6f)
        /root/go/src/github.com/influxdata/influxdb/cmd/influxd/main.go:98 +0x399
main.main()
        /root/go/src/github.com/influxdata/influxdb/cmd/influxd/main.go:46 +0xb2

goroutine 17 [syscall, 4158 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

goroutine 5 [syscall, 4158 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/go/src/runtime/sigqueue.go:116 +0x157
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
        /usr/local/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 7 [IO wait, 4158 minutes]:
net.runtime_pollWait(0x7f08bf4209e8, 0x72, 0x0)
        /usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc4201d8370, 0x72, 0xc420024de8, 0xc420010078)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc4201d8370, 0xe33720, 0xc420010078)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
jwilder commented 7 years ago

This should be fixed in the current nightlies/1.3 via 4aa78383d11dc327dc48f70a29a81ae352d5b546