Influxdb 2.0 Retention service hang trying to delete shard

Steps to reproduce:

Create two buckets:
- Metrics1h: 1 hour retention policy
- Metrics1d: 1 day retention policy

Run python script to generate metrics, each point contains 1000 fields, generate 100 points every 10 seconds for every service tag value.

    for i in range(100):
        service = 'service_{}'.format(i)
        p = influxdb_client.Point('service_stats')
        p.tag('service', service)
        for idx in range (1000):
            field = 'field_{}'.format(idx)
            p = p.field(field, idx)
        p.time(now, write_precision=WritePrecision.S)
        res = write_api.write(bucket=bucket, org=org, record=p)

Add a task to do downsampling with from Metrics1h to Metrics1d.
- Not sure if this matters, but add here to present the full picture.
```
option v = {timeRangeStart: -20m, timeRangeStop: -10m}
option task = {name: "MetricsDownsample1hTo1d", every: 10m}
```

from(bucket: "Metrics1h") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => (r["_measurement"] == "service_stats")) |> aggregateWindow(every: 5m, fn: sum, createEmpty: false) |> yield(name: "sum") |> to(bucket: "Metrics1d", org: "Test")


__Expected behavior:__
Retention service to run periodically, start and ends quickly.

__Actual behavior:__

ts=2021-02-07T14:01:43.058796Z lvl=info msg="Retention policy deletion check (start)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=start

Saw the start log, but it never ends.

From */debug/pprof/goroutine?debug=2*:

goroutine 200 [semacquire, 2058 minutes]: sync.runtime_Semacquire(0xc00dc84710) /usr/local/go/src/runtime/sema.go:56 +0x45 sync.(WaitGroup).Wait(0xc00dc84708) /usr/local/go/src/sync/waitgroup.go:130 +0x65 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(TSMReader).Close(0xc00dc84700, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:365 +0x4a github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(FileStore).Close(0xc00b9c2dd0, 0x407600, 0xc0061adeb8) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:615 +0xd0 github.com/influxdata/influxdb/v2/tsdb/engine/tsm1.(Engine).Close(0xc001f3a8c0, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:780 +0x9e github.com/influxdata/influxdb/v2/tsdb.(Shard).close(0xc003033680, 0xc001e326c0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:378 +0x47 github.com/influxdata/influxdb/v2/tsdb.(Shard).Close(0xc003033680, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/shard.go:368 +0x7b github.com/influxdata/influxdb/v2/tsdb.(Store).DeleteShard(0xc00035b000, 0xc1, 0x0, 0x0) /go/src/github.com/influxdata/influxdb/tsdb/store.go:804 +0x325 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).run(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:141 +0x209 github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open.func1(0xc001552af0, 0x3a46ae0, 0xc0052fbec0) /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:55 +0x6c created by github.com/influxdata/influxdb/v2/v1/services/retention.(Service).Open /go/src/github.com/influxdata/influxdb/v1/services/retention/service.go:53 +0x1ee

The delete stuck waiting on a delete reference of TSMReader object.

I am **guessing** that the downsampling task may be contributing to it.

goroutine 4613564 [select, 1714 minutes]: github.com/influxdata/flux/execute.(result).Do(0xc0057492c0, 0x3724850, 0xc00daaf840, 0xc0057492c0) /go/pkg/mod/github.com/influxdata/flux@v0.99.0/execute/result.go:61 +0xd9 github.com/influxdata/influxdb/v2/query/control.(errorCollectingTableIterator).Do(0xc00daaf840, 0x3724850, 0xc00daaf840, 0xc00daaf800) /go/src/github.com/influxdata/influxdb/query/control/controller.go:947 +0x42 github.com/influxdata/influxdb/v2/task/backend/executor.exhaustResultIterators(0x3a212a0, 0xc00daaf800, 0xc00daaf800, 0xc0042ea9b0) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:616 +0x51 github.com/influxdata/influxdb/v2/task/backend/executor.(worker).executeQuery(0xc019b3ae80, 0xc004dc4d80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:527 +0x57e github.com/influxdata/influxdb/v2/task/backend/executor.(worker).work(0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:421 +0x545 github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker.func1(0xc0001581c0, 0xc019b3ae80) /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:291 +0x76 created by github.com/influxdata/influxdb/v2/task/backend/executor.(Executor).startWorker /go/src/github.com/influxdata/influxdb/task/backend/executor/executor.go:288 +0xb6

On Crtl-C, I see a bunch of piled up tasks failed and shortly after it, the retention service run was unblocked.

ts=2021-02-09T00:26:17.108700Z lvl=info msg=Stopping log_id=0SAze8nW000 service=nats ts=2021-02-09T00:26:17.109767Z lvl=info msg=Stopping log_id=0SAze8nW000 service=bolt ts=2021-02-09T00:26:17.109808Z lvl=info msg=Stopping log_id=0SAze8nW000 service=query ts=2021-02-09T00:26:17.110168Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121881Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.121922Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122255Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122349Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.122498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128444Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128477Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.128498Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.129889Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc23cd0234000 error="database not open" ts=2021-02-09T00:26:17.130362Z lvl=info msg="Error exhausting result iterator" log_id=0SAze8nW000 service=task-executor error="context canceled" name=to8 ts=2021-02-09T00:26:17.148484Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc91aa0234000 error="database not open" ts=2021-02-09T00:26:17.148550Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcb6490234000 error="database not open" ts=2021-02-09T00:26:17.148657Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcdae80234000 error="database not open" ts=2021-02-09T00:26:17.148735Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bcff870234000 error="database not open" ts=2021-02-09T00:26:17.148806Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd24260234000 error="database not open" ts=2021-02-09T00:26:17.148876Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd48c50234000 error="database not open" ts=2021-02-09T00:26:17.148958Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd6d640234000 error="database not open" ts=2021-02-09T00:26:17.149026Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bd92030234000 error="database not open" ts=2021-02-09T00:26:17.149089Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bdb6a20234000 error="database not open" ts=2021-02-09T00:26:17.149156Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bddb410234000 error="database not open" ts=2021-02-09T00:26:17.149235Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b4fcbf0234000 error="database not open" ts=2021-02-09T00:26:17.149316Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bab5971234000 error="database not open" ts=2021-02-09T00:26:17.149399Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b28e300e34000 error="database not open" ts=2021-02-09T00:26:17.153145Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b822690234000 error="database not open" ts=2021-02-09T00:26:17.153281Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070b5d8790234000 error="database not open" ts=2021-02-09T00:26:17.153373Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bc6d0b0234000 error="database not open" ts=2021-02-09T00:26:17.153454Z lvl=error msg="Failed to finish run" log_id=0SAze8nW000 service=task-executor taskID=0706424470198000 runID=070bbda8f0234000 error="database not open" ts=2021-02-09T00:26:17.197384Z lvl=info msg="Deleted shard" log_id=0SAze8nW000 service=retention op_name=retention_delete_check db_instance=958a746c4d6ae0d4 db_shard_id=193 db_rp=autogen ts=2021-02-09T00:26:17.197621Z lvl=info msg="Retention policy deletion check (end)" log_id=0SAze8nW000 service=retention op_name=retention_delete_check op_event=end op_elapsed=123874138.839ms



__Environment info:__
Linux 4.15.0-55-CUSTOMIZED x86_64
InfluxDB 2.0.3 (git: fe04d346df) build_date: 2020-12-15T01:00:16Z
Running inside podman container

__Config:__
N/A

I believe the deadlock is actually coming from the Flux engine, even without the retention service (but the retention service also freezes up once a task hangs, because the TSM file is never un-ref'd).

Full reproducer info for the Flux team:

Create the buckets listed above
- Metrics1h w/ 1h RP
- Metrics1d w/ 1d RP

Create a task with this query, scheduled to run every 1m:

option task = {name: "MetricsDownsample1hTo1d", every: 1m}
option v = {timeRangeStart: -2m, timeRangeStop: -1m}

from(bucket: "Metrics1h")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
    |> filter(fn: (r) =>
        (r["_measurement"] == "service_stats"))
    |> aggregateWindow(every: 5m, fn: sum, createEmpty: false)
    |> yield(name: "sum")
    |> to(bucket: "Metrics1d", org: "<your-org>")

Run this script to generate data in Metrics1h:

from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS

import time

bucket = "Metrics1h"
org = "<your-org>"

client = InfluxDBClient(url="http://localhost:8086", token="<your-token>", org=org)

write_api = client.write_api(write_options=SYNCHRONOUS)

while True:
    now = int(time.time())
    print(f'Writing points at {now}')
    for i in range(100):
        service = 'service_{}'.format(i)
        p = Point('service_stats')
        p.tag('service', service)
        for idx in range (1000):
            field = 'field_{}'.format(idx)
            p = p.field(field, idx)
        p.time(now, write_precision=WritePrecision.S)
        try:
            res = write_api.write(bucket=bucket, org=org, record=p)
        except:
            continue
    time.sleep(1)

Wait until you see this command print some files (the signal that a query has locked up):
```
find ~/.influxdbv2/engine -name *.tmp
```

Check http://localhost:8086/debug/pprof/goroutine?debug=1, and see goroutines stuck on:

5 @ 0x403f565 0x404f625 0x4c6ba85 0x53b5f4f 0x53bfa90 0x59cc8c5 0x59cc02d 0x53b5c34 0x53b7ab0 0x53b589c 0x4c860da 0x40755c1
#   0x4c6ba84   github.com/influxdata/flux/execute.(*result).Process+0xe4                           /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/result.go:46
#   0x53b5f4e   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTable+0x1ae         /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:113
#   0x53bfa8f   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables.func1+0x4f       /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:77
#   0x59cc8c4   github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).handleRead+0x704          /Users/dan/influxdb/storage/flux/reader.go:812
#   0x59cc02c   github.com/influxdata/influxdb/v2/storage/flux.(*windowAggregateIterator).Do+0x34c              /Users/dan/influxdb/storage/flux/reader.go:665
#   0x53b5c33   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).processTables+0x93         /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:76
#   0x53b7aaf   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*readWindowAggregateSource).run+0x14f   /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:310
#   0x53b589b   github.com/influxdata/influxdb/v2/query/stdlib/influxdata/influxdb.(*Source).Run+0x33b              /Users/dan/influxdb/query/stdlib/influxdata/influxdb/source.go:56
#   0x4c860d9   github.com/influxdata/flux/execute.(*executionState).do.func1+0x1d9                     /Users/dan/go/pkg/mod/github.com/influxdata/flux@v0.105.1/execute/executor.go:284

The goroutines are stuck on this select block. The tables channel the engine is trying to write results into is buffered, defined here.

influxdata / influxdb

Influxdb 2.0 Retention service hang trying to delete shard #20729