Open pawanpraka1 opened 4 years ago
Issues go stale after 90d of inactivity.
@pawanpraka1 Are you suggesting some sort of alerting related configuration, that gives us triggers on x% capacity reach?
As with every Copy-on-Write file system, we run into problems already earlier than when the full capacity is reached. Many datasets will not be able to allocate new free space, even when there is still free space reported on the pool.
Levels of alerting should be at 70% to inform about the upcoming need to extend the capacity or to delete datasets/snapshots/bookmarks, one at 80% to inform about a current need to -"- and from then on in 5% intervals right up to 95%, when operations basically cease.
The pool metrics then go together with
Validate if the node exporter is providing metrics for ZFS pools. Based on this information, we can create alert rules. Scoping the validation for v4.2
When ZFS pool is full, it may lead to pool unavailable/not-accessible case. We should avoid going into this state.