andrewm4894 commented 1 year ago

Problem

Given all the information we have in an alert notification, can we use ML to build a model to predict which alerts are more of less likely to solicit some action from the user?

If we could, then this "alert ctr score" or "alert rank" score could be used by users to route or help prioritize alerts.

This builds on initial research work done in this GH discussion.

Following this work we are now at the point where we could train an initial model based on last 6 months or so of alert data, build a prediction endpoint and in some way then expose or make available this probability(alert_click) score in NC alert templates etc.

Description

What this would be would be an api endpoint that can take in the text of an alert and produce back a score from 0-100% on the probability of a click being observed on that alert given all the training data.

For example a ctr prob of:

50% = about average likelihood of this alert soliciting a click.
10% = this alert is very much less likely on average to solicit a click (could be used for filtering out noisey alerts no one ever clicks on).
80% = the alert is much more likely than average to solicit a click from a user (can be taken as further evidence that maybe this alert is more important in some sense).

Importance

really want

Value proposition

additional "context" to each alert. Think of it as saying "based on what we observe across the netdata community as a whole, this alert would on average tend to have a higher click or response rate, so maybe you might want to take that into account".

Proposed implementation

offline batch ETL for preparing modelling data.
regular training jobs each week to refresh the model.
a real time inference pipeline where model is deployed and in the hot path of alerts so that they can be enriched with the score. Or it could be done in a way that maybe decouples the scoring from existing dataflows and perhaps the score is requested a little bit after the alert has been created.

andrewm4894 commented 1 year ago

Adding some updates here on this as we have now had enough months of holdout data for really good validation. We have trained and re-trained the model each month and its performance is consistent and quite well calibrated.

Inputs

Inputs are 50/50 alerts with clicks (1's or positive outcomes) and without clicks (0's or negative examples). This is an example input.

{
  "text": "name is web_log_1m_successful .\nstatus from clear to warning .\nvalue is 90 % .\nchart is web_log_nginx.requests_by_type .\nfamily is requests .\nclassification is workload .\nrole is webmaster .\nwarning count is 2 .\ncritical count is 0 .\nduration hours is lte2 .\nnonclear duration hours is zero .",
  "label": 0
}

text is a templated text representation of everything we want the model to know about the alert:

name is web_log_1m_successful . 
status from clear to warning . 
value is 90 % . 
chart is web_log_nginx.requests_by_type . 
family is requests . 
classification is workload . 
role is webmaster . 
warning count is 2 . 
critical count is 0 . 
duration hours is lte2 . 
nonclear duration hours is zero .

so the template here looks like this:

name is <alert-name> . 
status from <alert-status-prev> to <alert-status-current> . 
value is <alert-value> <alert-units> . 
chart is <chart-name> . 
family is <chart-family> . 
classification is <chart-workload> . 
role is <alert-role> . 
warning count is <warning-count-active> . 
critical count is <critical-count-active> . 
duration hours is <alert-satus-duration> . 
nonclear duration hours is <non-clear-status-duration-hours> .

label is the 0 (no click) or 1 (click) outcome that we would like to model and predict.

Given these inputs we can treat the problem as a classification problem, given the text we want to learn a model that gives useful prob(label=1) probabilities.

Models

We have trained 3 models

alert-rank (Candidate Model): a distilbert-base-uncased text classification model using AutoModelForSequenceClassification from HuggingFace transformers.
alert-rank-simple (Simple Model Benchmark): A more traditional model using a random forest classifier and typical preprocessing of key features like alert-name, alert-status etc.
alert-rank-random (Random Benchmark): Same model as alert-rank but trained on a random dataset where we have shuffled the label such that there should not be anything to learn.

As expected the candidate alert-rank model has pretty good traditional performance metrics (we will discuss the metrics we actually care about later but is good to see things working as you would expect):

test classification report
              precision    recall  f1-score   support

           0       0.73      0.74      0.73     21919
           1       0.68      0.67      0.68     18476

    accuracy                           0.71     40395
   macro avg       0.70      0.70      0.70     40395
weighted avg       0.71      0.71      0.71     40395

test accuracy_score=0.7069
test precision_score=0.6827
test recall_score=0.6712
test f1_score=0.6769

Random model as expected shows no useful performance:

test classification report random
              precision    recall  f1-score   support

           0       0.54      1.00      0.70     21919
           1       0.00      0.00      0.00     18476

    accuracy                           0.54     40395
   macro avg       0.27      0.50      0.35     40395
weighted avg       0.29      0.54      0.38     40395

test random accuracy_score=0.5426
test random precision_score=0.0
test random recall_score=0.0
test random f1_score=0.0

While the simple model actually does seem to show some useful performance (albeit a very different balance in terms of precision and recall):

test classification report simple
              precision    recall  f1-score   support

           0       0.58      0.96      0.72     21919
           1       0.77      0.16      0.27     18476

    accuracy                           0.59     40395
   macro avg       0.67      0.56      0.49     40395
weighted avg       0.66      0.59      0.51     40395

test simple accuracy_score=0.5938
test simple precision_score=0.7655
test simple recall_score=0.1615
test simple f1_score=0.2667

Holdout Validation

Most important here is how well the trained model does in terms of ranking the alerts each day. In this regard we have been running the model in batch mode on all alerts every day for the most recent month to date that has not been included in training of the model.

Since we balance the training data as a 50/50 split but obviously don't when it comes to the holdout validation data (we just score all alerts to try measure true impact) we need to look at performance metrics on the validation data a little differently.

Each day we score all alerts from a few days ago (to give them enough time for ground truth labels to have arrived - eg alerts by now have either been clicked or not) and rank all alerts into 10 decile buckets based on the score (prob(click)) from the trained model.

When we do that we end up with a table like below (based on a random sample of holdout data):

true prob ntile	true_mean	true_count	true_sum	prob_true	no model mean	uplift factor
0	0.000387	7746	3	0.041177	0.004328	0.089483
1	0.000805	7458	6	0.105089	0.004328	0.185877
2	0.002105	7600	16	0.217449	0.004328	0.486412
3	0.003156	7605	24	0.263502	0.004328	0.729138
4	0.002601	7690	20	0.327520	0.004328	0.600899
5	0.003991	7516	30	0.415854	0.004328	0.922215
6	0.002758	7614	21	0.468084	0.004328	0.637242
7	0.005012	7582	38	0.555019	0.004328	1.157971
8	0.007733	7759	60	0.644172	0.004328	1.786665
9	0.014911	7444	111	0.812353	0.004328	3.445199

Here we see that for this random sample of held out alerts, the CTR of those in the top decile of scores is 3.44 times that of the no model benchmark of just averaging over all the alerts and basically ignoring the scores. Not only that, the separation as we go from decile 0 to decile 9 is as we would hope with lower scored deciles having lower (and below 1) uplift factors.

Graphically we see something like this:

For reference with the random model we see (no trend):

And for the simple model we do see some decent separation (eg top 25% vs bottom 25% for example) but it's no where near as good as the candidate model:

We have a clear ordering here as we go (left to right) from low score decile to high score deciles.

In fact the alerts in the top 10% are 38.5 times more likely to solicit a click than those in the bottom 10%. This is insanely good ranking and sorting and something marketers would kill for if this was a marketing use case (which is typical use case where approaches like this are used - you have some limitation and need to somehow pick the best performing messages to send - alert ranking is quite similar).

Design Considerations & Discussion

We have needed to do some preprocessing of the template data to avoid giving the model tokens that are too specific and so end up being user specific features. In some cases you might want then level of personalisation but in this case we have masked such rare and specific parts of things like alert-name and chart-name (where they can be user specific) with a special token such that the model can not overfit on any user specific features in the text. The aim here is to just use the context of what we know about the alert itself to learn the score. This is crucial so that the trained model will be able to generalize and "just work" for new users.
We also do some light pre-processing of things like <alert-value> where we round appropriately based on the alert units etc. The idea here is to try be as efficient as possible with tokens for the LLM. So for example alert values like "94.2 %" for cpu units will just get rounded like "90 %". Basically we are nudging the model a little here to think of an alert value of 94.2% as basically the same as 96.7% etc. In some cases we actually want lower resolution data or tokens in places as this has helped a lot with stability and generalization.
There is actually less preprocessing involved in terms of deployment to production for the llm transformer based alert-rank model than there would be for the more traditional model if we wanted to try do additional feature processing to get the alert-rank-simple model performance up a bit more.
We chose to use distilbert-base-uncased model as this is relatively small and easy to deploy on traditional hardware and should be able to handle latency and throughput requirements of our app and essentially be just another fairly simple small micro service to deploy to the backend.

Deployment

We have used BentoML to containerize and package the inference HuggingFace pipeline.

Product and UX

So, once we deploy we have a inference service that takes in the templated text for an alert and returns some json like this (in example below the model is very confident that the input alert is much less likely on average to solicit a click):

{"label":"LABEL_0","score":0.9948800802230835}

User understanding

We need to build into the UI a very intuitive explanation of the scores where anything around 40-60% is "about as likely as average" to solicit a click, anything with say 75%+ score is something the model thinks is much more likely than average to solicit a click and anything say 20% or less considered less likely than average.

This part is tricky, we picked a balanced 50/50 split for training in part so that we would have a nice 0-100% score for users on the other side. But it is a product challenge in how we explain and expose this to users. We could explore some sort of category based approach etc but i think ideally we would just educate and explain the score with good documentation in the app.

Product feature

One way to deliver the first product feature based on this would the when a user opens the alerts tab. We could trigger a batch inference for all the active alerts and then add the ranking as a new sortable column into the existing table. This would then give users another way to quickly sort their alerts.

This would be a good starting point since will only trigger traffic to the inference service when users actually look at all of their active alerts. Longer term we would of course want to run inference at or near time that the alert itself is generated so as to enable more advanced dynamic routying logic based on the Alert CTR Score itself.

andrewm4894 commented 1 year ago

@shyamvalsan fyi - the write up of the "Alert Rank" LLM project I said I'd do. Just leaving this here and we can do some calls so I can walk you through it as some stage.

netdata / netdata-cloud

[Feat]: introduce some form of "Alert CTR Score" #760

Problem

Description

Importance

Value proposition

Proposed implementation

Inputs

Models

Holdout Validation

Design Considerations & Discussion

Deployment

Product and UX

User understanding

Product feature