nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.58k stars 133 forks source link

Azure: implement sending memory metrics via diagnostic extension #2022

Closed francescolavra closed 3 months ago

francescolavra commented 3 months ago

This change set enhances the cloud_init klib by implementing an Azure VM agent (this fixes the "virtual machine agent status is not ready" warning that is currently displayed for Nanos instances in the Azure portal), and adds a new "azure" klib that implements an Azure extension similar to the Linux Diagnostic extension.

The current implementation supports sending 4 types of memory metrics (i.e. available and used memory, as both number of bytes and percentage of total memory). The azure klib is configured in the manifest options via an "azure" tuple; the diagnostic functionalities in this klib are enabled and configured by inserting a "diagnostic" tuple with the following attributes:

Example snippet of Ops configuration file:

"ManifestPassthrough": {
  "azure": {
    "diagnostics": {
      "storage_account": "mystorageaccount",
      "storage_account_sas": "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-05-22T14:50:28Z&st=2024-05-12T06:50:28Z&spr=https&sig=xxyyzz",
      "metrics": {"sample_interval": "15","transfer_interval": "60"}
    }
  }
}

Aggregated memory metrics data consist of the number of samples, the minimum, maximum, last, and average value, and the sum of all values; these data are inserted in an Azure storage table (one entity per aggregated data). The name of the table is in the format "WADMetricsxxxxP10DV2Syyyymmdd", where xxxx is the transfer interval expressed with ISO8601 format, and yyyymmdd is a representation of the 10-day date interval to which the metrics refer (thus, a new table is created every 10 days). For example, a table named "WADMetricsPT1MP10DV2S20240503" contains metrics data aggregated every minute ("PT1M" is the ISO8601 representation of a 1-minute period) generated for a 10-day period starting on May 3, 2024.

By default, the Azure portal does not display these metrics in its charts; in order for metrics to be available in the portal, the Linux Diagnostics Extension must be enabled and configured in a running instance (this can be done in the "Diagnostic settings" section in the portal) to match the settings in the Nanos manifest options. More specifically, the storage account and the metric aggregation interval specified in the Azure diagnostic settings must match those specified in the manifest options. Note: the Azure VM agent implemented in the cloud_init klib responds to requests to enable and configure the diagnostic extension, but does not actually apply the extension settings specified in the requests; instead, it always applies the settings from the manifest.

Closes https://github.com/nanovms/nanos/issues/2014