online-ml / chantilly

🍦 Deployment tool for online machine learning models
BSD 3-Clause "New" or "Revised" License
97 stars 17 forks source link

Update metrics - possibly missing a case? #24

Closed vsoch closed 2 years ago

vsoch commented 2 years ago

heyo! So I am using the same logic as chantilly after I've hit the learn endpoint and want to update metrics. Here is the basic logic refactored into its own function:

def update_metrics(self, prediction, ground_truth, model_name):
    """
    Given a prediction, update metrics to reflect it.
    """
    # Update the metrics
    metrics = self.db[f"metrics/{model_name}"]

    # At this point prediction is a dict. It might be empty because no training data has been seen
    if not prediction:
        return metrics

    for metric in metrics:

        # If the metrics requires labels but the prediction is a dict, then we need to retrieve the
        # predicted label with the highest probability
        if (
            isinstance(metric, ClassificationMetric)
            and metric.requires_labels
            and isinstance(prediction, dict)
        ):
            pred = max(prediction, key=prediction.get)
            metric.update(y_true=ground_truth, y_pred=pred)
        else:
            metric.update(y_true=ground_truth, y_pred=prediction)
    self.db[f"metrics/{model_name}"] = metrics
    return metrics

This works fine for the test.py case with regression, but when I was testing a binary model, I found that one of the metrics (I think Mean Squared Error or MSE?) wouldn't enter the first if case because it isn't a classification metric, and then it would hit the else and fail because at that point, prediction was either an empty dict or a filled dict (which cannot be given to that function.) So when I inspected I'd see either something like:

{}
# or
{1: 1.0}

So I think what was happening is that for the first binary model prediction, it returns an empty dict because it doesn't know anything. And then after that I think the dict has keys that are ground truths, and values that are the predictions? So first I added this:

# In some cases the prediction is a dict but not a ClassificationMetric
elif isinstance(prediction, dict):
    y_pred = prediction[ground_truth]
else:
    metric.update(y_true=ground_truth, y_pred=prediction)
self.db[f"metrics/{model_name}"] = metrics

And that worked for my debugging session while I had a model already established, but when I restarted from scratch I got a key error. And turns out, the ground truth wasn't a value returned by the prediction. So I changed to:

# In some cases the prediction is a dict but not a ClassificationMetric
elif isinstance(prediction, dict):
    y_pred = prediction.get(ground_truth)
    # The ground truth value isn't always known
    if y_pred:
        metric.update(y_true=ground_truth, y_pred=y_pred)
else:
    metric.update(y_true=ground_truth, y_pred=prediction)
self.db[f"metrics/{model_name}"] = metrics

This felt a little funky to me so I wanted to double check about the correct way to go about this. This is a binary model as defined here in chantilly. I suspect this will be tweaked further when I update to allow learning without having a true label (e.g., unsupervised cases I asked about in your talk!)

And then a follow up question - are there any river datasets / examples for multiclass? Thank you!

MaxHalford commented 2 years ago

Hey 👋

So one thing is that you shouldn't be using regression metrics in conjunction with classification models. They're incompatible by design. Each metric has a works_with method to help you check if a model is compatible with a metric.

And then a follow up question - are there any river datasets / examples for multiclass? Thank you!

Yes, I suggest checking out the multiclass module :)

vsoch commented 2 years ago

huh, that must be a bug then, because I do create the model as a binary type and not regression. I can look into this in an evening this week / this weekend. I refactored the design a bit so the metrics are namespaced with the model name, so as long as the original model's flavor is captured when it's added, they should follow suite, but I must have a bug.

Yes, I suggest checking out the multiclass module :)

Gah now I just feel silly, I was searching the datasets looking for hints of multiclass... sorry nothing to see here!

I'll close this issue after I investigate the potential bug - I'm still developing a lot pretty quickly and things will be more stable after I add the testing suite.

MaxHalford commented 2 years ago

Gah now I just feel silly

Don't! Even I get lost in all the River modules we have 😅

I'll close this issue after I investigate the potential bug - I'm still developing a lot pretty quickly and things will be more stable after I add the testing suite.

Sounds good! The work you're doing is really cool.

vsoch commented 2 years ago

Yep it was indeed a stupid bug - I forgot a format string "f" when setting the flavor so it was setting literally to "flavor/{name}" so this:

def init_metrics(name: str):
    db = get_db()
    try:
-        flavor = db["flavor/{name}"]
+        flavor = db[f"flavor/{name}"]
    except KeyError:
        raise exceptions.FlavorNotSet
    db[f"metrics/{name}"] = flavor.default_metrics()
vsoch commented 2 years ago

Thanks again for the help! I'm definitely making good progress - it's more fun and addictive than "real" work haha. :)

MaxHalford commented 2 years ago

It's cool that you're using the shelve library too, I like it a lot.

it's more fun and addictive than "real" work haha

Story of my life haha.

MaxHalford commented 2 years ago

Just thinking out loud: one of the things I would like this kind of platform to support is non-HTTP traffic. For instance, being able to handle WebSockets and/or SSE would be cool. Indeed, I think that persistent connections play nicely with online learning and sensors etc. I had my eyes on using FastAPI for this reason. Anyway, just some food for thought! I'm sure the same can be done with Django.

vsoch commented 2 years ago

Django can handle web sockets, but FastAPI is definitely faster than Django! And I agree that would be neat. The way I've designed Django River ML it's to have

  1. a pretty generic internal client that I could easily plug into another application backend (e.g., FastAPI) with only tweaks to setting things up and middleware and how the api endpoints are created
  2. a specification for the endpoints (I haven't written it out fully - it's in a markdown draft not written to version control yet but I will soon) and
  3. interacted with via also a common terminal client to make it easy (see https://github.com/vsoch/riverapi - sorry no pretty docs there yet but coming soon!)

And these things will make it easy to plug into whatever other Python frameworks we would be interested in! And I'd be happy to make one in FastAPI too, although I'd like to finish Django River ML first because I'm pretty excited to test it out for my use case :)

vsoch commented 2 years ago

And that reminds me, I was starting to brain storm what front end views we might provide for basic functionality. It's a plugin so we can't take over the entire design of an app, but I think some basic demo or views (even to include elsewhere) would be neat. I'm going to sleep now but something to think about - let me know what ideas you have!

MaxHalford commented 2 years ago

a pretty generic internal client that I could easily plug into another application backend (e.g., FastAPI) with only tweaks to setting things up and middleware and how the api endpoints are created

Exactly! In fact it should be able to do this online learning dance without any connection to the internet. For instance, when running a model on a closed-off device where the loop is self-contained. What matters here is to create the right abstractions via interfaces.

interacted with via also a common terminal client to make it easy (see https://github.com/vsoch/riverapi - sorry no pretty docs there yet but coming soon!)

Agreed! You need a client to interact with the "platform".

And I'd be happy to make one in FastAPI too, although I'd like to finish Django River ML first because I'm pretty excited to test it out for my use case :)

Enjoy :)

And that reminds me, I was starting to brain storm what front end views we might provide for basic functionality. It's a plugin so we can't take over the entire design of an app, but I think some basic demo or views (even to include elsewhere) would be neat.

Yes I agree with that. My experience here is that the interface should be read-only: you can't use the interface to add/remove models or what not. It's just a view.

vsoch commented 2 years ago

A follow up question! I'm testing the multi-class example, and for example here is my model:

In [18]: model
Out[18]: 
Pipeline (
  StandardScaler (
    with_std=True
  ),
  OneVsOneClassifier (
    classifier=LogisticRegression (
      optimizer=SGD (
        lr=Constant (
          learning_rate=0.01
        )
      )
      loss=Log (
        weight_pos=1.
        weight_neg=1.
      )
      l2=0.
      intercept_init=0.
      intercept_lr=Constant (
        learning_rate=0.01
      )
      clip_gradient=1e+12
      initializer=Zeros ()
    )
  )
)

And it has the function but I suspect it just raises not implemented error:

model.predict_
                        predict_many()       predict_proba_many()
                        predict_one()        predict_proba_one() 

So the question for the server - the standard case is always predicting one, but this doesn't seem to work here for this multiclass flavor:

~/anaconda3/envs/river/lib/python3.10/site-packages/river/base/classifier.py in predict_proba_one(self, x)
     49         # method that each classifier has to implement. Instead, we raise an exception to indicate
     50         # that a classifier does not support predict_proba_one.
---> 51         raise NotImplementedError
     52 
     53     def predict_one(self, x: dict) -> base.typing.ClfTarget:

NotImplementedError:

I think this happened because we are checking for the functions as attributes on the model, but since the abstract class implements it, it is technically there

def check_model(self, model):
        for method in ("learn_one", "predict_proba_one"):
            if not hasattr(model, method):
                return False, f"The model does not implement {method}."
        return True, None

but not actually implemented there (looks like we have predict_one and learn_one?) https://github.com/online-ml/river/blob/ec1cf318310add301afe12160cebc66eaebcec2c/river/multiclass/ovo.py#L74-L97

Anyhoo - so my questions are:

  1. How should we handle this check method?
  2. What prediction/learn functions should the multiclass flavor use?

I'm also noticing that for these datasets, they are under multiclass but labeled as binary, but "a multiclass should work too!"

image

That alone might be the issue - that we aren't expected to use multiclass? But I was hoping to find a dataset / example that can use it (and I would like the server to support it!)

MaxHalford commented 2 years ago

I think this happened because we are checking for the functions as attributes on the model, but since the abstract class implements it, it is technically there

This is specific to that multi-class model; it doesn't support outputting probabilities. It implements predict_one, but not predict_proba_one. This is very much an edge-case, because you should expect every classifier (regardless of binary or multi-class) to implement predict_proba_one.

How should we handle this check method?

I'm not sure how to check a method is implemented with the current way River's base classes are set up.

What prediction/learn functions should the multiclass flavor use?

It should always be predict_proba_one and learn_one. But I would try/except the predict_proba_one and use predict_one if the former raises an implementation error.

vsoch commented 2 years ago

I'll try that! And eventually there should be some clarity to the user about which models will work, and which not.

vsoch commented 2 years ago

I tried changing to predict_one to avoid the implementation error (that worked)! but the ground truth and predictions I get back are both strings, and when that hits the 'update_metrics function it falls into the last if/else:

In [5]: prediction
Out[5]: 'path'

In [6]: ground_truth
Out[6]: 'sky'

and results in this error in river:

In [4]:                 metric.update(y_true=ground_truth, y_pred=prediction)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/Desktop/Code/django-river-ml/django_river_ml/client.py in <module>
----> 1 metric.update(y_true=ground_truth, y_pred=prediction)

~/anaconda3/envs/river/lib/python3.10/site-packages/river/metrics/base.py in update(self, y_true, y_pred, sample_weight)
    424 
    425     def update(self, y_true, y_pred, sample_weight=1.0):
--> 426         self._mean.update(x=self._eval(y_true, y_pred), w=sample_weight)
    427         return self
    428 

~/anaconda3/envs/river/lib/python3.10/site-packages/river/metrics/cross_entropy.py in _eval(self, y_true, y_pred)
     46 
     47     def _eval(self, y_true, y_pred):
---> 48         return optim.losses.CrossEntropy()(y_true, y_pred)

~/anaconda3/envs/river/lib/python3.10/site-packages/river/optim/losses.py in __call__(self, y_true, y_pred)
    246         total = 0
    247 
--> 248         for label, proba in y_pred.items():
    249             if y_true == label:
    250                 total += self.class_weight.get(label, 1.0) * math.log(

AttributeError: 'str' object has no attribute 'items'

Might it be the case that the "predict_one" is returning the wrong format? Or are we in another case of a model being incorrectly matched with the metrics? For metrics I have:

[Accuracy: 0.00%,
 CrossEntropy: 0.,
 MacroPrecision: 0.,
 MacroRecall: 0.,
 MacroF1: 0.,
 MicroPrecision: 0.,
 MicroRecall: 0.,
 MicroF1: 0.]

model flavor:

Pipeline (
  StandardScaler (
    with_std=True
  ),
  OneVsOneClassifier (
    classifier=LogisticRegression (
      optimizer=SGD (
        lr=Constant (
          learning_rate=0.01
        )
      )
      loss=Log (
        weight_pos=1.
        weight_neg=1.
      )
      l2=0.
      intercept_init=0.
      intercept_lr=Constant (
        learning_rate=0.01
      )
      clip_gradient=1e+12
      initializer=Zeros ()
    )
  )
)

I've never used Logistic regression in a multiclass case - I'm used to getting between 0 and 1 and applying some threshold for two classes. For reference I was looking here: https://riverml.xyz/latest/api/multiclass/OneVsOneClassifier/ that is grouping it under multiclass (hence why I'm probably incorrectly using it here!)

And I'm thinking about the design of model "flavors" - if it's the case that it's hard to generalize models into these flavors, it perhaps would make sense to have a direct lookup of "for model X use this base class, metrics, etc." but maybe we can still get it working for the more general flavors already here.

Thanks for your help! I should be able to make some more time this weekend to work on river - had a busy end of the week!

MaxHalford commented 2 years ago

Mmm I'm not sure I fully understand what you're doing.

Classification metrics expect labels or dictionaries with a probability for each label. Each classification metric has a requires_label property indicating this. Does that help?

vsoch commented 2 years ago

Sorry for not being clear, let me try to better walk through it.

  1. So I start with the wrapper LogisticRegression example that I linked above. it's under "multiclass" and I chose it to test that endpoint.
  2. This results in the MultiClass model, and the following metrics:
[Accuracy: 0.00%,
 CrossEntropy: 0.,
 MacroPrecision: 0.,
 MacroRecall: 0.,
 MacroF1: 0.,
 MicroPrecision: 0.,
 MicroRecall: 0.,
 MicroF1: 0.]

and this is correct, at least reflected also in Chantilly here

  1. When we are updating metrics (this part of Chantilly) for metric CrossEntropy: 0.,

So now we have:

- ground_truth
Out[26]: 'sky'

In [27]: prediction
Out[27]: 'path'

And calling this method:

 metric.update(y_true=ground_truth, y_pred=prediction)

results in the error above - the prediction is a string and not a dict.

If we trace back up to here where the original error was, to recall our previous discussion, the flavor is multiclass (per coming from the multiclass examples)

flavor
Out[30]: <django_river_ml.flavors.MultiClassFlavor at 0x7fa1b8e11900>

so the default pred_func is predict_proba_one which fails with the not implemented error. So I fall back to predict_one and that is the reason we return a string.

So I think what I'm hearing is that I'm not allowed to provide a str to this metric. So here are some options for moving forward:

  1. This model is in fact not multiclass and I shouldn't be using it in that context, period, in which my suggestion for the implementation is that we do a better mapping of model -> model type and remove the need for the user to specify it (and possibly get it wrong)
  2. This case is possible but the label I'm returned back needs to be reformatted into a different structure to allow it to work
  3. This case is possible but should not be allowed to update a metric given that a str is returned

In any case, I do think this feels a little buggy and we should get to the bottom of it - let me know what questions you have!

MaxHalford commented 2 years ago

Ok I think I understand. Thanks for the details.

Your problem is that OneVsOneClassifier doesn't implement predict_proba_one. So you can't produce probabilities. This means that the metrics which requires probability estimates, namely CrossEntropy, should not be updated, period.

How I would handle this:

  1. Try to call predict_proba_one
  2. Catch NotImplementedError
  3. Fall back to predict_one
  4. Loop over the metrics
  5. Skip cases where metric.requires_label == False

How does that sound?

vsoch commented 2 years ago

Okay that worked! For a regression model with mean absolute error (MAE) there simply wasn't an attribute requires_labels and it hit the last statement because it's not a ClassificationMetric (this is for a regression model) so instead of checking for that attribute, I'm going to do the dump Pythonic thing and just try/except and if the metric works to be updated, great, if not, well it's probably not intended for the case or (if it's a bug) we will find it eventually :)

I'm done adding the multiclass, going to do the new label endpoint and write some tests and then do a PR - will post here when it's in!

vsoch commented 2 years ago

okay tiny progress! https://github.com/vsoch/django-river-ml/pull/7 This is great - we have label and a multiclass example, and I think next I'm going to give a shot at a KMeans model (or unsupervised). Thanks again for your help - I suspect it's late in your time zone so I will not ping further today!

MaxHalford commented 2 years ago

Looks cool! Keep it up :)

(or unsupervised)

Yes I believe there is a lot of interest in supporting anomaly detection.