influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.93k stars 3.55k forks source link

Influxdb 2.0 Retention service hang trying to delete shard #20729

Open WeiliangLuo opened 3 years ago

WeiliangLuo commented 3 years ago

Steps to reproduce:

  1. Create two buckets:
    • Metrics1h: 1 hour retention policy
    • Metrics1d: 1 day retention policy
  2. Run python script to generate metrics, each point contains 1000 fields, generate 100 points every 10 seconds for every service tag value.
        for i in range(100):
            service = 'service_{}'.format(i)
            p = influxdb_client.Point('service_stats')
            p.tag('service', service)
            for idx in range (1000):
                field = 'field_{}'.format(idx)
                p = p.field(field, idx)
            p.time(now, write_precision=WritePrecision.S)
            res = write_api.write(bucket=bucket, org=org, record=p)
  3. Add a task to do downsampling with from Metrics1h to Metrics1d.
    • Not sure if this matters, but add here to present the full picture.
      
      option v = {timeRangeStart: -20m, timeRangeStop: -10m}
      option task = {name: "MetricsDownsample1hTo1d", every: 10m}

from(bucket: "Metrics1h") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => (r["_measurement"] == "service_stats")) |> aggregateWindow(every: 5m, fn: sum, createEmpty: false) |> yield(name: "sum") |> to(bucket: "Metrics1d", org: "Test")


__Expected behavior:__
Retention service to run periodically, start and ends quickly.

__Actual behavior:__

ts=2021-02-07T14:01:43.058796Z lvl=info msg="Retention policy deletion check (start)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=start

Saw the start log, but it never ends.

From */debug/pprof/goroutine?debug=2*:

goroutine 200 [semacquire, 2058 minutes]: sync.runtime_Semacquire(0xc00dc84710) /usr/local/go/src/runtime/sema.go:56 +0x45 sync.(WaitGroup).Wait(0xc00dc84708) /usr/local/go/src/sync/waitgroup.go:130 +0x65 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(TSMReader).Close(0xc00dc84700, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:365 +0x4a github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(FileStore).Close(0xc00b9c2dd0, 0x407600, 0xc0061adeb8) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:615 +0xd0 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(Engine).Close(0xc001f3a8c0, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:780 +0x9e github.com/influxdata/influxdb/v2/tsdb.(Shard).close(0xc003033680, 0xc001e326c0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:378 +0x47 github.com/influxdata/influxdb/v2/tsdb.(Shard).Close(0xc003033680, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:368 +0x7b github.com/influxdata/influxdb/v2/tsdb.(Store).DeleteShard(0xc00035b000, 0xc1, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/store.go:804 +0x325 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).run(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:141 +0x209 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open.func1(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:55 +0x6c created by github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:53 +0x1ee

The delete stuck waiting on a delete reference of TSMReader object.

I am **guessing** that the downsampling task may be contributing to it. 

goroutine 4613564 [select, 1714 minutes]: github.com/influxdata/flux/execute.(result).Do(0xc0057492c0, 0x3724850, 0xc00daaf840, 0xc0057492c0) /go/pkg/mod/github.com/influxdata/flux@v0.99.0/execute/result.go:61 +0xd9 github.com/influxdata/influxdb/v2/query/control.(errorCollectingTableIterator).Do(0xc00daaf840, 0x3724850, 0xc00daaf840, 0xc00daaf800) /go/src/github.com/influxdata/influxdb/query/control/controller.go:947 +0x42 github.com/influxdata/influxdb/v2/task/backend/executor.exhaustResultIterators(0x3a212a0, 0xc00daaf800, 0xc00daaf800, 0xc0042ea9b0) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:616 +0x51 github.com/influxdata/influxdb/v2/task/backend/executor.(worker).executeQuery(0xc019b3ae80, 0xc004dc4d80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:527 +0x57e github.com/influxdata/influxdb/v2/task/backend/executor.(worker).work(0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:421 +0x545 github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker.func1(0xc0001581c0, 0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:291 +0x76 created by github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:288 +0xb6

On Crtl-C, I see a bunch of piled up tasks failed and shortly after it, the retention service run was unblocked. 

ts=2021-02-09T00:26:17.108700Z lvl=info msg=Stopping log_id=0SAze8nW000 service=nats ts=2021-02-09T00:26:17.109767Z lvl=info msg=Stopping log_id=0SAze8nW000 service=bolt ts=2021-02-09T00:26:17.109808Z lvl=info msg=Stopping log_id=0SAze8nW000 service=query ts=2021-02-09T00:26:17.110168Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121881Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121922Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122255Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122349Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128444Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128477Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.129889Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc23cd0234000 error="database not open" ts=2021-02-09T00:26:17.130362Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.148484Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc91aa0234000 error="database not open" ts=2021-02-09T00:26:17.148550Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcb6490234000 error="database not open" ts=2021-02-09T00:26:17.148657Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcdae80234000 error="database not open" ts=2021-02-09T00:26:17.148735Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcff870234000 error="database not open" ts=2021-02-09T00:26:17.148806Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd24260234000 error="database not open" ts=2021-02-09T00:26:17.148876Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd48c50234000 error="database not open" ts=2021-02-09T00:26:17.148958Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd6d640234000 error="database not open" ts=2021-02-09T00:26:17.149026Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd92030234000 error="database not open" ts=2021-02-09T00:26:17.149089Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bdb6a20234000 error="database not open" ts=2021-02-09T00:26:17.149156Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bddb410234000 error="database not open" ts=2021-02-09T00:26:17.149235Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b4fcbf0234000 error="database not open" ts=2021-02-09T00:26:17.149316Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bab5971234000 error="database not open" ts=2021-02-09T00:26:17.149399Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b28e300e34000 error="database not open" ts=2021-02-09T00:26:17.153145Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b822690234000 error="database not open" ts=2021-02-09T00:26:17.153281Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b5d8790234000 error="database not open" ts=2021-02-09T00:26:17.153373Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc6d0b0234000 error="database not open" ts=2021-02-09T00:26:17.153454Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bbda8f0234000 error="database not open" ts=2021-02-09T00:26:17.197384Z lvl=info msg="Deleted shard" log_id=0SAze8nW000 service=retention op_name=retention_delete_check db_instance=958a746c4d6ae0d4 db_shard_id=193 db_rp=autogen ts=2021-02-09T00:26:17.197621Z lvl=info msg="Retention policy deletion check (end)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=end op_elapsed=123874138.839ms



__Environment info:__
Linux 4.15.0-55-CUSTOMIZED x86_64
InfluxDB 2.0.3 (git: fe04d346df) build_date: 2020-12-15T01:00:16Z
Running inside podman container

__Config:__
N/A
danxmoran commented 3 years ago

This is an excellent bug report, thank you!

danxmoran commented 3 years ago

For anyone else seeing this: A side-effect of the task queries hanging is that .tsm.tmp files will gradually accumulate in your data directory, because they can't be cleaned up until the queries reading from them complete. The files will be cleaned after rebooting the server.

danxmoran commented 3 years ago

I believe the deadlock is actually coming from the Flux engine, even without the retention service (but the retention service also freezes up once a task hangs, because the TSM file is never un-ref'd).

Full reproducer info for the Flux team:

  1. Create the buckets listed above
    • Metrics1h w/ 1h RP
    • Metrics1d w/ 1d RP
  2. Create a task with this query, scheduled to run every 1m:

    option task = {name: "MetricsDownsample1hTo1d", every: 1m}
    option v = {timeRangeStart: -2m, timeRangeStop: -1m}
    
    from(bucket: "Metrics1h")
        |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
        |> filter(fn: (r) =>
            (r["_measurement"] == "service_stats"))
        |> aggregateWindow(every: 5m, fn: sum, createEmpty: false)
        |> yield(name: "sum")
        |> to(bucket: "Metrics1d", org: "<your-org>")
  3. Run this script to generate data in Metrics1h:

    from influxdb_client import InfluxDBClient, Point, WritePrecision
    from influxdb_client.client.write_api import SYNCHRONOUS
    
    import time
    
    bucket = "Metrics1h"
    org = "<your-org>"
    
    client = InfluxDBClient(url="http://localhost:8086", token="<your-token>", org=org)
    
    write_api = client.write_api(write_options=SYNCHRONOUS)
    
    while True:
        now = int(time.time())
        print(f'Writing points at {now}')
        for i in range(100):
            service = 'service_{}'.format(i)
            p = Point('service_stats')
            p.tag('service', service)
            for idx in range (1000):
                field = 'field_{}'.format(idx)
                p = p.field(field, idx)
            p.time(now, write_precision=WritePrecision.S)
            try:
                res = write_api.write(bucket=bucket, org=org, record=p)
            except:
                continue
        time.sleep(1)
  4. Wait until you see this command print some files (the signal that a query has locked up):
    find ~/.influxdbv2/engine -name *.tmp
  5. Check http://localhost:8086/debug/pprof/goroutine?debug=1, and see goroutines stuck on:
    5 @ 0x403f565 0x404f625 0x4c6ba85 0x53b5f4f 0x53bfa90 0x59cc8c5 0x59cc02d 0x53b5c34 0x53b7ab0 0x53b589c 0x4c860da 0x40755c1
    #   0x4c6ba84   github.com/influxdata/flux/execute.(*result).Process+0xe4                           /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/result.go:46
    #   0x53b5f4e   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTable+0x1ae         /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:113
    #   0x53bfa8f   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables.func1+0x4f       /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:77
    #   0x59cc8c4   github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).handleRead+0x704          /Users/dan/influxdb/storage/flux/reader.go:812
    #   0x59cc02c   github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).Do+0x34c              /Users/dan/influxdb/storage/flux/reader.go:665
    #   0x53b5c33   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables+0x93         /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:76
    #   0x53b7aaf   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*readWindowAggregateSource).run+0x14f   /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:310
    #   0x53b589b   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).Run+0x33b              /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:56
    #   0x4c860d9   github.com/influxdata/flux/execute.(*executionState).do.func1+0x1d9                     /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/executor.go:284

The goroutines are stuck on this select block. The tables channel the engine is trying to write results into is buffered, defined here.

danxmoran commented 3 years ago

@WeiliangLuo the Flux team has advised a few work-arounds: