os-climate / os_c_data_commons

Repository for Data Commons platform architecture overview, as well as developer and user documentation
Apache License 2.0
18 stars 10 forks source link

Event-Driven Data Ingestion POC: Implement monitoring for GPU Usage #129

Open caldeirav opened 2 years ago

caldeirav commented 2 years ago

There is a need to have better monitoring and management for GPU usage - for the two sets of GPU we are planning to use. To look into building a pipeline from the AWS monitoring data, and integrate into Kafka running on the cluster.

This could be a good technical POC for event-driven data ingestion pattern we have discussed with LSEG, and something a new team member with event-driven architecture knowledge could work on fairly easily.

HeatherAck commented 1 year ago

Mikhail working on this

HeatherAck commented 1 year ago

still need to implement GPU dashboard but won't use Kafka event-driven approach - still in progress

HeatherAck commented 1 year ago

ETA 30-Nov

HeatherAck commented 1 year ago

@redmikhail to implement the week of 12-Dec

HeatherAck commented 1 year ago

in progress - resolving some issues where CL1/CL2 in different states to keep in sync; will need PRs approved (Ryan here to 23rd; Eric here to 22nd)

HeatherAck commented 1 year ago

installed 2 dashboards on CL1, but fixes still required and add to operate first to deploy to CL2. still need to update NVIDI

HeatherAck commented 1 year ago

Link to dashboard: https://grafana-opf-monitoring.apps.odh-cl1.apps.os-climate.org/d/r9x7iJMVz/gpu-utilization-dashboard?orgId=1&refresh=15s