Open andrewm4894 opened 1 year ago
Adding some updates here on this as we have now had enough months of holdout data for really good validation. We have trained and re-trained the model each month and its performance is consistent and quite well calibrated.
Inputs are 50/50 alerts with clicks (1's or positive outcomes) and without clicks (0's or negative examples). This is an example input.
{
"text": "name is web_log_1m_successful .\nstatus from clear to warning .\nvalue is 90 % .\nchart is web_log_nginx.requests_by_type .\nfamily is requests .\nclassification is workload .\nrole is webmaster .\nwarning count is 2 .\ncritical count is 0 .\nduration hours is lte2 .\nnonclear duration hours is zero .",
"label": 0
}
text
is a templated text representation of everything we want the model to know about the alert:
name is web_log_1m_successful .
status from clear to warning .
value is 90 % .
chart is web_log_nginx.requests_by_type .
family is requests .
classification is workload .
role is webmaster .
warning count is 2 .
critical count is 0 .
duration hours is lte2 .
nonclear duration hours is zero .
so the template here looks like this:
name is <alert-name> .
status from <alert-status-prev> to <alert-status-current> .
value is <alert-value> <alert-units> .
chart is <chart-name> .
family is <chart-family> .
classification is <chart-workload> .
role is <alert-role> .
warning count is <warning-count-active> .
critical count is <critical-count-active> .
duration hours is <alert-satus-duration> .
nonclear duration hours is <non-clear-status-duration-hours> .
label
is the 0
(no click) or 1
(click) outcome that we would like to model and predict.
Given these inputs we can treat the problem as a classification problem, given the text
we want to learn a model that gives useful prob(label=1)
probabilities.
We have trained 3 models
alert-rank
(Candidate Model): a distilbert-base-uncased
text classification model using AutoModelForSequenceClassification
from HuggingFace transformers.alert-rank-simple
(Simple Model Benchmark): A more traditional model using a random forest classifier and typical preprocessing of key features like alert-name
, alert-status
etc.alert-rank-random
(Random Benchmark): Same model as alert-rank
but trained on a random
dataset where we have shuffled the label
such that there should not be anything to learn.As expected the candidate alert-rank
model has pretty good traditional performance metrics (we will discuss the metrics we actually care about later but is good to see things working as you would expect):
test classification report
precision recall f1-score support
0 0.73 0.74 0.73 21919
1 0.68 0.67 0.68 18476
accuracy 0.71 40395
macro avg 0.70 0.70 0.70 40395
weighted avg 0.71 0.71 0.71 40395
test accuracy_score=0.7069
test precision_score=0.6827
test recall_score=0.6712
test f1_score=0.6769
Random model as expected shows no useful performance:
test classification report random
precision recall f1-score support
0 0.54 1.00 0.70 21919
1 0.00 0.00 0.00 18476
accuracy 0.54 40395
macro avg 0.27 0.50 0.35 40395
weighted avg 0.29 0.54 0.38 40395
test random accuracy_score=0.5426
test random precision_score=0.0
test random recall_score=0.0
test random f1_score=0.0
While the simple model actually does seem to show some useful performance (albeit a very different balance in terms of precision and recall):
test classification report simple
precision recall f1-score support
0 0.58 0.96 0.72 21919
1 0.77 0.16 0.27 18476
accuracy 0.59 40395
macro avg 0.67 0.56 0.49 40395
weighted avg 0.66 0.59 0.51 40395
test simple accuracy_score=0.5938
test simple precision_score=0.7655
test simple recall_score=0.1615
test simple f1_score=0.2667
Most important here is how well the trained model does in terms of ranking the alerts each day. In this regard we have been running the model in batch mode on all alerts every day for the most recent month to date that has not been included in training of the model.
Since we balance the training data as a 50/50 split but obviously don't when it comes to the holdout validation data (we just score all alerts to try measure true impact) we need to look at performance metrics on the validation data a little differently.
Each day we score all alerts from a few days ago (to give them enough time for ground truth labels to have arrived - eg alerts by now have either been clicked or not) and rank all alerts into 10 decile buckets based on the score (prob(click)
) from the trained model.
When we do that we end up with a table like below (based on a random sample of holdout data):
true prob ntile | true_mean | true_count | true_sum | prob_true | no model mean | uplift factor |
---|---|---|---|---|---|---|
0 | 0.000387 | 7746 | 3 | 0.041177 | 0.004328 | 0.089483 |
1 | 0.000805 | 7458 | 6 | 0.105089 | 0.004328 | 0.185877 |
2 | 0.002105 | 7600 | 16 | 0.217449 | 0.004328 | 0.486412 |
3 | 0.003156 | 7605 | 24 | 0.263502 | 0.004328 | 0.729138 |
4 | 0.002601 | 7690 | 20 | 0.327520 | 0.004328 | 0.600899 |
5 | 0.003991 | 7516 | 30 | 0.415854 | 0.004328 | 0.922215 |
6 | 0.002758 | 7614 | 21 | 0.468084 | 0.004328 | 0.637242 |
7 | 0.005012 | 7582 | 38 | 0.555019 | 0.004328 | 1.157971 |
8 | 0.007733 | 7759 | 60 | 0.644172 | 0.004328 | 1.786665 |
9 | 0.014911 | 7444 | 111 | 0.812353 | 0.004328 | 3.445199 |
Here we see that for this random sample of held out alerts, the CTR of those in the top decile of scores is 3.44
times that of the no model benchmark of just averaging over all the alerts and basically ignoring the scores. Not only that, the separation as we go from decile 0 to decile 9 is as we would hope with lower scored deciles having lower (and below 1) uplift factors.
Graphically we see something like this:
For reference with the random model we see (no trend):
And for the simple model we do see some decent separation (eg top 25% vs bottom 25% for example) but it's no where near as good as the candidate model:
We have a clear ordering here as we go (left to right) from low score decile to high score deciles.
In fact the alerts in the top 10% are 38.5 times more likely to solicit a click than those in the bottom 10%. This is insanely good ranking and sorting and something marketers would kill for if this was a marketing use case (which is typical use case where approaches like this are used - you have some limitation and need to somehow pick the best performing messages to send - alert ranking is quite similar).
alert-name
and chart-name
(where they can be user specific) with a special token such that the model can not overfit on any user specific features in the text
. The aim here is to just use the context of what we know about the alert itself to learn the score. This is crucial so that the trained model will be able to generalize and "just work" for new users.<alert-value>
where we round appropriately based on the alert units etc. The idea here is to try be as efficient as possible with tokens for the LLM. So for example alert values like "94.2 %" for cpu units will just get rounded like "90 %". Basically we are nudging the model a little here to think of an alert value of 94.2% as basically the same as 96.7% etc. In some cases we actually want lower resolution data or tokens in places as this has helped a lot with stability and generalization.alert-rank
model than there would be for the more traditional model if we wanted to try do additional feature processing to get the alert-rank-simple
model performance up a bit more.distilbert-base-uncased
model as this is relatively small and easy to deploy on traditional hardware and should be able to handle latency and throughput requirements of our app and essentially be just another fairly simple small micro service to deploy to the backend.We have used BentoML to containerize and package the inference HuggingFace pipeline.
So, once we deploy we have a inference service that takes in the templated text
for an alert and returns some json like this (in example below the model is very confident that the input alert is much less likely on average to solicit a click):
{"label":"LABEL_0","score":0.9948800802230835}
We need to build into the UI a very intuitive explanation of the scores where anything around 40-60% is "about as likely as average" to solicit a click, anything with say 75%+ score is something the model thinks is much more likely than average to solicit a click and anything say 20% or less considered less likely than average.
This part is tricky, we picked a balanced 50/50 split for training in part so that we would have a nice 0-100% score for users on the other side. But it is a product challenge in how we explain and expose this to users. We could explore some sort of category based approach etc but i think ideally we would just educate and explain the score with good documentation in the app.
One way to deliver the first product feature based on this would the when a user opens the alerts tab. We could trigger a batch inference for all the active alerts and then add the ranking as a new sortable column into the existing table. This would then give users another way to quickly sort their alerts.
This would be a good starting point since will only trigger traffic to the inference service when users actually look at all of their active alerts. Longer term we would of course want to run inference at or near time that the alert itself is generated so as to enable more advanced dynamic routying logic based on the Alert CTR Score itself.
@shyamvalsan fyi - the write up of the "Alert Rank" LLM project I said I'd do. Just leaving this here and we can do some calls so I can walk you through it as some stage.
Problem
Given all the information we have in an alert notification, can we use ML to build a model to predict which alerts are more of less likely to solicit some action from the user?
If we could, then this "alert ctr score" or "alert rank" score could be used by users to route or help prioritize alerts.
This builds on initial research work done in this GH discussion.
Following this work we are now at the point where we could train an initial model based on last 6 months or so of alert data, build a prediction endpoint and in some way then expose or make available this
probability(alert_click)
score in NC alert templates etc.Description
What this would be would be an api endpoint that can take in the text of an alert and produce back a score from 0-100% on the probability of a click being observed on that alert given all the training data.
For example a ctr prob of:
Importance
really want
Value proposition
Proposed implementation