influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.71k stars 3.54k forks source link

Retention service hangs and does not remove old shards #25054

Open gwossum opened 3 months ago

gwossum commented 3 months ago

Under certain conditions, the retention service can become hung waiting on a shard's reference count to drop to zero. When this happens, no other shards can be removed by the retention service. This can eventually result in high disk usage.

The attached goroutine trace shows a system exhibiting the issue. The retention service is stuck on waiting on the WaitGroup used to indicate that the references to the shard have dropped to zero. goroutine.txt

davidby-influx commented 3 months ago

Can you create port issues for all the repos and branches, @gwossum ?