netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Feat]: introduce some form of "Alert CTR Score" #760

Open andrewm4894 opened 1 year ago

andrewm4894 commented 1 year ago

Problem

Given all the information we have in an alert notification, can we use ML to build a model to predict which alerts are more of less likely to solicit some action from the user?

If we could, then this "alert ctr score" or "alert rank" score could be used by users to route or help prioritize alerts.

This builds on initial research work done in this GH discussion.

Following this work we are now at the point where we could train an initial model based on last 6 months or so of alert data, build a prediction endpoint and in some way then expose or make available this probability(alert_click) score in NC alert templates etc.

Description

What this would be would be an api endpoint that can take in the text of an alert and produce back a score from 0-100% on the probability of a click being observed on that alert given all the training data.

For example a ctr prob of:

Importance

really want

Value proposition

  1. additional "context" to each alert. Think of it as saying "based on what we observe across the netdata community as a whole, this alert would on average tend to have a higher click or response rate, so maybe you might want to take that into account".

Proposed implementation

andrewm4894 commented 1 year ago

Adding some updates here on this as we have now had enough months of holdout data for really good validation. We have trained and re-trained the model each month and its performance is consistent and quite well calibrated.

Inputs

Inputs are 50/50 alerts with clicks (1's or positive outcomes) and without clicks (0's or negative examples). This is an example input.

{
  "text": "name is web_log_1m_successful .\nstatus from clear to warning .\nvalue is 90 % .\nchart is web_log_nginx.requests_by_type .\nfamily is requests .\nclassification is workload .\nrole is webmaster .\nwarning count is 2 .\ncritical count is 0 .\nduration hours is lte2 .\nnonclear duration hours is zero .",
  "label": 0
}

text is a templated text representation of everything we want the model to know about the alert:

name is web_log_1m_successful . 
status from clear to warning . 
value is 90 % . 
chart is web_log_nginx.requests_by_type . 
family is requests . 
classification is workload . 
role is webmaster . 
warning count is 2 . 
critical count is 0 . 
duration hours is lte2 . 
nonclear duration hours is zero .

so the template here looks like this:

name is <alert-name> . 
status from <alert-status-prev> to <alert-status-current> . 
value is <alert-value> <alert-units> . 
chart is <chart-name> . 
family is <chart-family> . 
classification is <chart-workload> . 
role is <alert-role> . 
warning count is <warning-count-active> . 
critical count is <critical-count-active> . 
duration hours is <alert-satus-duration> . 
nonclear duration hours is <non-clear-status-duration-hours> .

label is the 0 (no click) or 1 (click) outcome that we would like to model and predict.

Given these inputs we can treat the problem as a classification problem, given the text we want to learn a model that gives useful prob(label=1) probabilities.

Models

We have trained 3 models

As expected the candidate alert-rank model has pretty good traditional performance metrics (we will discuss the metrics we actually care about later but is good to see things working as you would expect):

test classification report
              precision    recall  f1-score   support

           0       0.73      0.74      0.73     21919
           1       0.68      0.67      0.68     18476

    accuracy                           0.71     40395
   macro avg       0.70      0.70      0.70     40395
weighted avg       0.71      0.71      0.71     40395

test accuracy_score=0.7069
test precision_score=0.6827
test recall_score=0.6712
test f1_score=0.6769

Random model as expected shows no useful performance:

test classification report random
              precision    recall  f1-score   support

           0       0.54      1.00      0.70     21919
           1       0.00      0.00      0.00     18476

    accuracy                           0.54     40395
   macro avg       0.27      0.50      0.35     40395
weighted avg       0.29      0.54      0.38     40395

test random accuracy_score=0.5426
test random precision_score=0.0
test random recall_score=0.0
test random f1_score=0.0

While the simple model actually does seem to show some useful performance (albeit a very different balance in terms of precision and recall):

test classification report simple
              precision    recall  f1-score   support

           0       0.58      0.96      0.72     21919
           1       0.77      0.16      0.27     18476

    accuracy                           0.59     40395
   macro avg       0.67      0.56      0.49     40395
weighted avg       0.66      0.59      0.51     40395

test simple accuracy_score=0.5938
test simple precision_score=0.7655
test simple recall_score=0.1615
test simple f1_score=0.2667

Holdout Validation

Most important here is how well the trained model does in terms of ranking the alerts each day. In this regard we have been running the model in batch mode on all alerts every day for the most recent month to date that has not been included in training of the model.

Since we balance the training data as a 50/50 split but obviously don't when it comes to the holdout validation data (we just score all alerts to try measure true impact) we need to look at performance metrics on the validation data a little differently.

Each day we score all alerts from a few days ago (to give them enough time for ground truth labels to have arrived - eg alerts by now have either been clicked or not) and rank all alerts into 10 decile buckets based on the score (prob(click)) from the trained model.

When we do that we end up with a table like below (based on a random sample of holdout data):

true prob ntile true_mean true_count true_sum prob_true no model mean uplift factor
0 0.000387 7746 3 0.041177 0.004328 0.089483
1 0.000805 7458 6 0.105089 0.004328 0.185877
2 0.002105 7600 16 0.217449 0.004328 0.486412
3 0.003156 7605 24 0.263502 0.004328 0.729138
4 0.002601 7690 20 0.327520 0.004328 0.600899
5 0.003991 7516 30 0.415854 0.004328 0.922215
6 0.002758 7614 21 0.468084 0.004328 0.637242
7 0.005012 7582 38 0.555019 0.004328 1.157971
8 0.007733 7759 60 0.644172 0.004328 1.786665
9 0.014911 7444 111 0.812353 0.004328 3.445199

Here we see that for this random sample of held out alerts, the CTR of those in the top decile of scores is 3.44 times that of the no model benchmark of just averaging over all the alerts and basically ignoring the scores. Not only that, the separation as we go from decile 0 to decile 9 is as we would hope with lower scored deciles having lower (and below 1) uplift factors.

Graphically we see something like this:

image

For reference with the random model we see (no trend):

image

And for the simple model we do see some decent separation (eg top 25% vs bottom 25% for example) but it's no where near as good as the candidate model:

image

We have a clear ordering here as we go (left to right) from low score decile to high score deciles.

In fact the alerts in the top 10% are 38.5 times more likely to solicit a click than those in the bottom 10%. This is insanely good ranking and sorting and something marketers would kill for if this was a marketing use case (which is typical use case where approaches like this are used - you have some limitation and need to somehow pick the best performing messages to send - alert ranking is quite similar).

Design Considerations & Discussion

Deployment

We have used BentoML to containerize and package the inference HuggingFace pipeline.

Product and UX

So, once we deploy we have a inference service that takes in the templated text for an alert and returns some json like this (in example below the model is very confident that the input alert is much less likely on average to solicit a click):

{"label":"LABEL_0","score":0.9948800802230835}

User understanding

We need to build into the UI a very intuitive explanation of the scores where anything around 40-60% is "about as likely as average" to solicit a click, anything with say 75%+ score is something the model thinks is much more likely than average to solicit a click and anything say 20% or less considered less likely than average.

This part is tricky, we picked a balanced 50/50 split for training in part so that we would have a nice 0-100% score for users on the other side. But it is a product challenge in how we explain and expose this to users. We could explore some sort of category based approach etc but i think ideally we would just educate and explain the score with good documentation in the app.

Product feature

One way to deliver the first product feature based on this would the when a user opens the alerts tab. We could trigger a batch inference for all the active alerts and then add the ranking as a new sortable column into the existing table. This would then give users another way to quickly sort their alerts.

image

This would be a good starting point since will only trigger traffic to the inference service when users actually look at all of their active alerts. Longer term we would of course want to run inference at or near time that the alert itself is generated so as to enable more advanced dynamic routying logic based on the Alert CTR Score itself.

andrewm4894 commented 1 year ago

@shyamvalsan fyi - the write up of the "Alert Rank" LLM project I said I'd do. Just leaving this here and we can do some calls so I can walk you through it as some stage.