Open WeiliangLuo opened 3 years ago
This is an excellent bug report, thank you!
For anyone else seeing this: A side-effect of the task queries hanging is that .tsm.tmp
files will gradually accumulate in your data directory, because they can't be cleaned up until the queries reading from them complete. The files will be cleaned after rebooting the server.
I believe the deadlock is actually coming from the Flux engine, even without the retention service (but the retention service also freezes up once a task hangs, because the TSM file is never un-ref'd).
Full reproducer info for the Flux team:
Metrics1h
w/ 1h RPMetrics1d
w/ 1d RPCreate a task with this query, scheduled to run every 1m
:
option task = {name: "MetricsDownsample1hTo1d", every: 1m}
option v = {timeRangeStart: -2m, timeRangeStop: -1m}
from(bucket: "Metrics1h")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) =>
(r["_measurement"] == "service_stats"))
|> aggregateWindow(every: 5m, fn: sum, createEmpty: false)
|> yield(name: "sum")
|> to(bucket: "Metrics1d", org: "<your-org>")
Run this script to generate data in Metrics1h
:
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
import time
bucket = "Metrics1h"
org = "<your-org>"
client = InfluxDBClient(url="http://localhost:8086", token="<your-token>", org=org)
write_api = client.write_api(write_options=SYNCHRONOUS)
while True:
now = int(time.time())
print(f'Writing points at {now}')
for i in range(100):
service = 'service_{}'.format(i)
p = Point('service_stats')
p.tag('service', service)
for idx in range (1000):
field = 'field_{}'.format(idx)
p = p.field(field, idx)
p.time(now, write_precision=WritePrecision.S)
try:
res = write_api.write(bucket=bucket, org=org, record=p)
except:
continue
time.sleep(1)
find ~/.influxdbv2/engine -name *.tmp
http://localhost:8086/debug/pprof/goroutine?debug=1
, and see goroutines stuck on:
5 @ 0x403f565 0x404f625 0x4c6ba85 0x53b5f4f 0x53bfa90 0x59cc8c5 0x59cc02d 0x53b5c34 0x53b7ab0 0x53b589c 0x4c860da 0x40755c1
# 0x4c6ba84 github.com/influxdata/flux/execute.(*result).Process+0xe4 /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/result.go:46
# 0x53b5f4e github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTable+0x1ae /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:113
# 0x53bfa8f github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables.func1+0x4f /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:77
# 0x59cc8c4 github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).handleRead+0x704 /Users/dan/influxdb/storage/flux/reader.go:812
# 0x59cc02c github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).Do+0x34c /Users/dan/influxdb/storage/flux/reader.go:665
# 0x53b5c33 github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables+0x93 /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:76
# 0x53b7aaf github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*readWindowAggregateSource).run+0x14f /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:310
# 0x53b589b github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).Run+0x33b /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:56
# 0x4c860d9 github.com/influxdata/flux/execute.(*executionState).do.func1+0x1d9 /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/executor.go:284
The goroutines are stuck on this select
block. The tables
channel the engine is trying to write results into is buffered, defined here.
@WeiliangLuo the Flux team has advised a few work-arounds:
map(fn: (r) => r)
before the call to to
yield
entirely, or put the yield
after the to
Steps to reproduce:
from(bucket: "Metrics1h") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => (r["_measurement"] == "service_stats")) |> aggregateWindow(every: 5m, fn: sum, createEmpty: false) |> yield(name: "sum") |> to(bucket: "Metrics1d", org: "Test")
ts=2021-02-07T14:01:43.058796Z lvl=info msg="Retention policy deletion check (start)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=start
goroutine 200 [semacquire, 2058 minutes]: sync.runtime_Semacquire(0xc00dc84710) /usr/local/go/src/runtime/sema.go:56 +0x45 sync.(WaitGroup).Wait(0xc00dc84708) /usr/local/go/src/sync/waitgroup.go:130 +0x65 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(TSMReader).Close(0xc00dc84700, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:365 +0x4a github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(FileStore).Close(0xc00b9c2dd0, 0x407600, 0xc0061adeb8) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:615 +0xd0 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(Engine).Close(0xc001f3a8c0, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:780 +0x9e github.com/influxdata/influxdb/v2/tsdb.(Shard).close(0xc003033680, 0xc001e326c0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:378 +0x47 github.com/influxdata/influxdb/v2/tsdb.(Shard).Close(0xc003033680, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:368 +0x7b github.com/influxdata/influxdb/v2/tsdb.(Store).DeleteShard(0xc00035b000, 0xc1, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/store.go:804 +0x325 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).run(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:141 +0x209 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open.func1(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:55 +0x6c created by github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:53 +0x1ee
goroutine 4613564 [select, 1714 minutes]: github.com/influxdata/flux/execute.(result).Do(0xc0057492c0, 0x3724850, 0xc00daaf840, 0xc0057492c0) /go/pkg/mod/github.com/influxdata/flux@v0.99.0/execute/result.go:61 +0xd9 github.com/influxdata/influxdb/v2/query/control.(errorCollectingTableIterator).Do(0xc00daaf840, 0x3724850, 0xc00daaf840, 0xc00daaf800) /go/src/github.com/influxdata/influxdb/query/control/controller.go:947 +0x42 github.com/influxdata/influxdb/v2/task/backend/executor.exhaustResultIterators(0x3a212a0, 0xc00daaf800, 0xc00daaf800, 0xc0042ea9b0) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:616 +0x51 github.com/influxdata/influxdb/v2/task/backend/executor.(worker).executeQuery(0xc019b3ae80, 0xc004dc4d80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:527 +0x57e github.com/influxdata/influxdb/v2/task/backend/executor.(worker).work(0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:421 +0x545 github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker.func1(0xc0001581c0, 0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:291 +0x76 created by github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:288 +0xb6
ts=2021-02-09T00:26:17.108700Z lvl=info msg=Stopping log_id=0SAze8nW000 service=nats ts=2021-02-09T00:26:17.109767Z lvl=info msg=Stopping log_id=0SAze8nW000 service=bolt ts=2021-02-09T00:26:17.109808Z lvl=info msg=Stopping log_id=0SAze8nW000 service=query ts=2021-02-09T00:26:17.110168Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121881Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121922Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122255Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122349Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128444Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128477Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.129889Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc23cd0234000 error="database not open" ts=2021-02-09T00:26:17.130362Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.148484Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc91aa0234000 error="database not open" ts=2021-02-09T00:26:17.148550Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcb6490234000 error="database not open" ts=2021-02-09T00:26:17.148657Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcdae80234000 error="database not open" ts=2021-02-09T00:26:17.148735Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcff870234000 error="database not open" ts=2021-02-09T00:26:17.148806Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd24260234000 error="database not open" ts=2021-02-09T00:26:17.148876Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd48c50234000 error="database not open" ts=2021-02-09T00:26:17.148958Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd6d640234000 error="database not open" ts=2021-02-09T00:26:17.149026Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd92030234000 error="database not open" ts=2021-02-09T00:26:17.149089Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bdb6a20234000 error="database not open" ts=2021-02-09T00:26:17.149156Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bddb410234000 error="database not open" ts=2021-02-09T00:26:17.149235Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b4fcbf0234000 error="database not open" ts=2021-02-09T00:26:17.149316Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bab5971234000 error="database not open" ts=2021-02-09T00:26:17.149399Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b28e300e34000 error="database not open" ts=2021-02-09T00:26:17.153145Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b822690234000 error="database not open" ts=2021-02-09T00:26:17.153281Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b5d8790234000 error="database not open" ts=2021-02-09T00:26:17.153373Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc6d0b0234000 error="database not open" ts=2021-02-09T00:26:17.153454Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bbda8f0234000 error="database not open" ts=2021-02-09T00:26:17.197384Z lvl=info msg="Deleted shard" log_id=0SAze8nW000 service=retention op_name=retention_delete_check db_instance=958a746c4d6ae0d4 db_shard_id=193 db_rp=autogen ts=2021-02-09T00:26:17.197621Z lvl=info msg="Retention policy deletion check (end)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=end op_elapsed=123874138.839ms