splunk / splunk-add-on-microsoft-azure

Splunk Add-on for Microsoft Azure
Apache License 2.0
11 stars 7 forks source link

Azure Metric data stops being retrieved due to invalid metric list #45

Closed AndrewTrobec closed 1 year ago

AndrewTrobec commented 1 year ago

Hello,

I am having an issue with syncronizing Azure Metric data. I am using TA version 4.0.3 on Splunk Enterprise 8.2.9 running as a HF. Issue is that for no reason subscriptions that have successfully being syncing data suddenly stop. With TA on DEBUG mode I can only determine that at a certain point the connector starts failing. Here is an example where within one 5 minute iteration I see that a call to the same subscription has been split from one call to two, and is now returning status 400 instead of 200:

image

I managed to replicate the same failing call via Postman where I see that there are invalid metric names in the call that are causing the endpoint to return an error:

image

After removing metrics tempdb_data_size, tempdb_log_size, tempdb_log_used_percent, sqlserver_process_core_percent, and sqlserver_process_memory_percent from the failing calls it worked in Postman.

Since I only configure the namespaces I want and not the metric list, how can I debug this issue?

Thanks!

Andrew

JasonConger commented 1 year ago

The add-on enumerates the available metrics for the namespace and caches them in the KV Store. This cache is updated every 30 days by default (code link below). https://github.com/splunk/splunk-add-on-microsoft-azure/blob/cf788fca119d993c0e124af8e4fb29bd587fcea3/package/bin/ta_azure_utils/metrics.py#L99

I see that tempdb_data_size, tempdb_log_size, tempdb_log_used_percent, sqlserver_process_core_percent, and sqlserver_process_memory_percent are valid metrics according to https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-supported#microsoftsqlserversdatabases

Doing some more digging, it looks like the above metrics are considered "Advanced metrics" and are dependent on databases using vCore models => https://learn.microsoft.com/en-us/azure/azure-sql/database/metrics-diagnostic-telemetry-logging-streaming-export-configure?view=azuresql&tabs=azure-portal#advanced-metrics . This seems to be the root cause of the error. I'm unsure when these "Advanced metrics" showed up in the metric definitions from the Microsoft REST API, but they seem to be breaking the input.

As an immediate workaround for this particular issue, an exclusion list can be added in the code. An enhancement will be added to the add-on to support excluded metrics via the UI.

Starting on line 184 or metrics.py, add an exclusion list for the metrics causing the issue (note the # Begin exclusion code and # End exclusion code sections):

    # Index the preferred metrics
    for metric_list in metric_list_preferred:
        # Begin exclusion code
        exclude_metrics = ['tempdb_data_size','tempdb_log_size','tempdb_log_used_percent','sqlserver_process_core_percent','sqlserver_process_memory_percent']
        metric_list = [metric for metric in metric_list if metric not in exclude_metrics]
        # End exclusion code
        metric_url = management_base_url + "%s/providers/microsoft.insights/metrics?api-version=2018-01-01&timespan=%s&interval=%s&aggregation=%s&metricnames=%s" % \
            (resource_obj["resource_id"], metric_timespan, preferred_time_aggregation, ",".join(requested_metric_statistics), ",".join(metric_list))
        helper.log_debug("_Splunk_ input_name=%s Preferred metric URL: %s" % (input_name, metric_url))
        _index_metrics(helper, ew, access_token, resource_obj, metric_url, requested_metric_statistics, metric_aggregations)

    # Index the alternate metrics
    for metric_list in metric_list_alternate:
        # Begin exclusion code
        exclude_metrics = ['tempdb_data_size','tempdb_log_size','tempdb_log_used_percent','sqlserver_process_core_percent','sqlserver_process_memory_percent']
        metric_list = [metric for metric in metric_list if metric not in exclude_metrics]
        # End exclusion code
        metric_url = management_base_url + "%s/providers/microsoft.insights/metrics?api-version=2018-01-01&timespan=%s&aggregation=%s&metricnames=%s" % \
            (resource_obj["resource_id"], metric_timespan, ",".join(requested_metric_statistics), ",".join(metric_list))
        helper.log_debug("_Splunk_ input_name=%s Alternate metric URL: %s" % (input_name, metric_url))
        _index_metrics(helper, ew, access_token, resource_obj, metric_url, requested_metric_statistics, metric_aggregations)
AndrewTrobec commented 1 year ago

@JasonConger Thank you so much for providing this! Also, sorry for taking so long to update, we were devising a mechanism that overrides the metric list based on your suggestion above plus Easter break. Happy to say all is working!