nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Calculate Storage actually used vs allocated in OpenShift #315

Open joachimweyl opened 1 year ago

joachimweyl commented 1 year ago

Motivation

We should be able to track storage and trigger a warning at say 10% storage remaining. Technically we can oversubscribe as volumes are thin subscribed and many projects will never use the full space allocated.

Completion Criteria

Calculate storage actually used vs allocated. Alert at 90% used.

Description

Notes

  1. how many bytes of volumes we have actually allocated (e..g, a thin provisioned 80GB volume will consume 80GB)
  2. how many bytes of volumes are actually used (e.g., a thin provisioned 80GB volume may be only using 1MB right now)
  3. how much quota we have handed out
  4. rbd du -p <POOLNAME> should return the value we want

Completion dates

Desired - early 2024 Required - mid 2024

msdisme commented 2 months ago

Is there a way to track rate of fill? Eg. how quickly are allocations being used up to project what we may need? (query only-not a request to implement this part)

msdisme commented 1 month ago

Bringing back to strategy/roadmapping meeting for priority

msdisme commented 1 week ago

Eg. extension of images store. @jtriley @waygil - in terms of eliminating service disruption. May be covered by observability.

hpdempsey commented 1 week ago

@schwesig @computate this is already in the plan for Observability to cover. Can you investigate and post a followup here.

schwesig commented 1 week ago

ACK

schwesig commented 1 week ago

@joachimweyl with whom can I have a call about this to clarify. The 80GB are maybe an example, but close to some GPU memory sizes. So I want to be sure what memory, how you define it. I assume we are only talking about prod cluster and coldfront claimed projects?

joachimweyl commented 1 week ago

@schwesig you can set up a meeting with @jtriley to gather information about storage totals. To gather information about usage I assume observability can gather that in some way. The 80GB piece discussed in the notes section is an example. essentially what it is trying to say is that Coldfront might say that someone is allocated 80GB or heck 8TB but that is not what their systems are currently using. For example one of those 80GB or 8TB drives might be actually using only 100MB of the space and that leaves 79.9GB or 7.9999TB the idea behind this issue is to find a way to calculate the requested value (coldfront) and then to also find out the amount actually in use (observability).