Open micya opened 1 year ago
Since we need to ultimately monitor across a range of different platforms, we will need a push-based system (as opposed to pull/scraped system like raw Prometheus).
Hey @micya, noticed the Canadian Integrated Ocean Observing System is has an uptime monitor that is based on open source code https://github.com/upptime/upptime. It might not be able to help with the instances, but could help ensure we know when any of these sites are not available:
Hey @micya -- Just noting a couple recent thoughts on possible tools, integrations, and/or data sources for an over-arching dashboard (i.e. maybe for not only the Azure-based realtime inference system, but the whole emerging ecosystem of Orcasound apps, APIs, and data layers):
orcanode
code we used to use logDNA, now Mezmo for monitoring processes and errors on each streaming computer.orcasound.net
Wordpress site.acoustic-sandbox
bucket where we will store labeled data, so maybe a dashboard feature could be quantifying the current, growing size of our labeled data sets (e.g. 13,435 SRKW call labels, with 13% validated to call type)?
- Line chart for Cosmos DB read/write metrics
A sub-feature of a CosmoDB read line chart that I would find interesting:
Number of API requests from "outsiders" -- a possible metric for measuring the value of our open labeled to external collaborators, e.g. ML developers or bioacousticians.
We (@xilin22 and I) looked into setting up Prometheus and Grafana for a health dashboard, but determined Grafana doesn't allow individuals with personal accounts to access the Grafana dashboard without having a work or school account. (See following error:)
There's a feedback request for this feature, but it doesn't seem as though the Grafana team is looking to implement this any time soon.
We are now looking into using Azure Workbooks for data visualization instead, which is newer and may solve some of the pain points that were called out in 2022.
As for the alerting, we can add more azure functions to monitor service and resource health. Since Azure Managed Grafana does not allow personal accounts to login into Azure Managed Grafana instance
@micya @scottveirs We may be able to get Azure Managed Grafana to work if we create our own organizational domain. It might be worth a shot if there is little to no cost in creating one. Maybe then Azure won't view it as personal account.
@micya @scottveirs We may be able to get Azure Managed Grafana to work if we create our own organizational domain. It might be worth a shot if there is little to no cost in creating one. Maybe then Azure won't view it as personal account.
We already have an organization. If you create a user in our AAD tenant, that should work. Though we would then need to track the username/password for the new user.
That makes sense. I dont have permissions to create one. Maybe either you @micya and @scottveirs can create one and send me the credentials?
@xilin22 - granted "User Administrator" on AAD tenant. Let me know if that doesn't work.
Historically, troubleshooting for inference system/notification system failures involved manual steps to identify failures. Past hackathon focused on utilizing Azure Dashboards to surface some metrics from Log Analytics. However, Azure Dashboards is difficult for non-technical observers to use.
I'd like to look into setting up something separate from Azure for monitoring purposes. It can either be a self-developed application or an existing monitoring solution (prometheus?). It should show at minimum: