online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5.07k stars 544 forks source link

Question about centroid changes #891

Closed vsoch closed 2 years ago

vsoch commented 2 years ago

heyo! So this might be better suited for chantilly, but it's generally about river so I hope it's okay to put here. I'm creating a server that has a cluster model on the backend to handle clustering types of errors. I was planning on using a cluster model and then assigning new errors (saved as objects in the database) to clusters, but I realize if I have a preference for models with a changing number of centers then my assignments would also likely change over time (and not be valid). This makes the idea of saving the cluster id with the error object not a great idea. But then I was thinking - as long as we have some kind of ability for a model to output changes (e.g.,:

Then it could be feasible to act on those events, and in the case of the above:

So my question - how do we integrate this functionality into river, at least exposing enough for a wrapper of some kind to assess changes? Or is it just a bad design idea on my part?

vsoch commented 2 years ago

Here's an idea - what if I could use a KNN model alongside this one, and then for the sliding window here https://github.com/online-ml/river/blob/9ea7707636e825a374cf053d149e3946c746ce2d/river/neighbors/base_neighbors.py#L101-L102 keep a vector of identifiers (of my own choosing) that I could relate back to my own objects or data, and then have a flavor of KNN that returns those full centroids / identifiers instead? It looks like right now the KNN models here always attempt to classify and none of them actually give back a group of meaningful centroids (e.g., here are data points used in the window that you provided earlier, do as you please with them!) Would this be of interest to others? (it's of interest to me!)

vsoch commented 2 years ago

And follow up question (I implemented the model above, am testing) can we make knn flexible to different numbers of features? E.g., here:

if self._n_features != get_dimensions(x)[1]:
            raise ValueError(
                "Inconsistent number of features in X: {}, previously observed {}.".format(
                    get_dimensions(x)[1], self._n_features
                )
            )
smastelini commented 2 years ago

Hi @vsoch, indeed the current KNN have some limitations. A direct solution for different sets of features is to always rely on exhaustive search.

I'm currently working on ways of improve online knn as part of my research internship. Hopefully I should have results soon.

MaxHalford commented 2 years ago

Hey @vsoch! There are good questions. I kinda always knew in the back of my mind that our API for clustering doesn't answer practical use cases. Your questions make this very apparent.

So my question - how do we integrate this functionality into river, at least exposing enough for a wrapper of some kind to assess changes? Or is it just a bad design idea on my part?

I actually don't know. The lazy answer is that you should reclassify every observation if you're using a clustering model with a changing number of clusters. In fact, even if the number of clusters stays the same, you have to reclassify observations because the clusters will have moved. I'll let you think a bit more about this. If you feel like you still need the functionality you require, then we can think about something.

keep a vector of identifiers (of my own choosing) that I could relate back to my own objects or data, and then have a flavor of KNN that returns those full centroids / identifiers instead? It looks like right now the KNN models here always attempt to classify and none of them actually give back a group of meaningful centroids (e.g., here are data points used in the window that you provided earlier, do as you please with them!) Would this be of interest to others? (it's of interest to me!)

Sorry but I'm not following this part.

And follow up question (I implemented the model above, am testing) can we make knn flexible to different numbers of features?

Indeed this is a current limitation. It's a bit silly really because an earlier version of the KNN models did support this. @smastelini will cook something up at some point.

vsoch commented 2 years ago

Indeed this is a current limitation. It's a bit silly really because an earlier version of the KNN models did support this. @smastelini will cook something up at some point.

Is there something I can try, sooner than in the time of an internship, that would work? E.g., I was thinking just adding a boolean to the base class to allow / not allow, and then looking at the intersection of features. I actually already implemented the vector of identifiers (and I think it should work) but I hit the issue above, doh.

keep a vector of identifiers (of my own choosing) that I could relate back to my own objects or data, and then have a flavor of KNN that returns those full centroids / identifiers instead? It looks like right now the KNN models here always attempt to classify and none of them actually give back a group of meaningful centroids (e.g., here are data points used in the window that you provided earlier, do as you please with them!) Would this be of interest to others? (it's of interest to me!)

So for example let's say I have 3 data points that I use for learning. Each of the original data points has an entry in my data base with the full original context of the error (a package, version, build step, and the full log). And then a user triggers an error, it is parsed by a client into features, and sent to the server to run predict. Assume we've changed KNN predict to return not a predictoin but the actual neighbors and identifiers we stored earlier.

Given a neighbor size of 3 (yes the entire set lo) I want the server to tell the user back "Here are 3 errors that look very much like yours!) With the current implementation (at least for other cluster method), I only get back a few words that are features of the original ones. With the current KNN approaches, it tries to give me a classification. What I want back is more like:

Then I can easily look up error-1, error-2, and error-3 in my database to get the full context of the errors, logs, pretty much anything I want and explicitly show that to the user. So in scope of river, you don't need to worry about the database, we just need a means to be able to additionally tag a point with a user-provided identifier to get back later with a set of neighbors (instead of a mean or prediction). I actually think I have this implemented but I'm stuck on the having different features bit. @smastelini your research internship sounds awesome, but I'm hoping we can discuss a basic workaround / solution that I can use sooner (tonight, soon? lol). Thank you to you both!

vsoch commented 2 years ago

I actually don't know. The lazy answer is that you should reclassify every observation if you're using a clustering model with a changing number of clusters. In fact, even if the number of clusters stays the same, you have to reclassify observations because the clusters will have moved. I'll let you think a bit more about this. If you feel like you still need the functionality you require, then we can think about something.

Oops sorry forgot to respond to this! So if we have to re-classify, I think that would (in a tiny way) defeat the purpose of using online ML in the first place. If I need to reclassify with some frequency, I could just use a batch method instead, because I need to setup some worker to do that at some frequency anyway.

And missing context - I decided to use the clustering for just visualization, and I'm working on a KNN model that, as long as I can store these identifiers (and allow a changing features set), I can use for prediction to give back an actual set of known errors in the database!

smastelini commented 2 years ago

Hi @vsoch, a quick solution for the knn problem is to replace the current version (inherited from skmultiflow) by the one from creme. Feel free to bring it back, at least from my side. I would do that at some point exactly to ensure dynamic feature sets.

I'll read the remaining of your concerns calmly, as I am not following everything too. One thing to always keep in mind though, is that clustering is primarily meant to do descriptive rather than inductive learning. These are two different beasts. So, in principle, moving centroids, label changing and whatnot are just part of the game.

smastelini commented 2 years ago

Also, most of the clustering methods rely on micro-clusters. In other words, they kind of run a batch-based clustering from time to time. This could either be a disadvantage because of delayed cluster updates, or an opportunity to align the re-clustering intervals with the classification you need to do. But again, I still didn't get the whole picture. After I finish some immediate tasks I have to do, I'll get back on this issue.

vsoch commented 2 years ago

Thanks @smastelini I'll try that today! And looking forward to seeing what you are working on.

vsoch commented 2 years ago

heyo! So I made some tiny progress today, and likely nothing substantial to contribute to river, but I wanted to share in case it inspired any thinking.

Summary of Problem

To re-iterate the problem, we want some kind of KNN setup that allows for returning the full set of points (not some prediction that summarizes the neighbors) and also can return the original identifiers for the datum. The reason for that is more in some kind of server you want to be able to look up full metadata and tell the person that submit a new piece of data "here is what we know about others that are similar." But a clustering algorithm isn't appropriate because it's giving us more understanding than classification, and it would be especially challenging for algorithms that have changing clusters and/or micro-clusters! So we look to KNN, which is finding nearest neighbors - they are real points we can say something about.

We cannot do this in the current river KNN implementations because, while keeping the identifiers might be possible, the window is setup to expect the same number of features! It's not that this is bad, but for a streaming problem where you just don't know what you are going to get, it will break quickly (I hit this error on my N=2 data point!).

Starting with Creme

So I started with the creme knn per suggestion of @smastelini (thank you!), and the most notable difference about this implementation is that it's much simpler - instead of the vector approach it simply keeps a collections.deque, or list of lists of things, where each thing might look like:

(x, y)

and the deque bit is a data structure from collections that let's us define a max length (in our case, the window size) and up to that length we keep things, when we hit it we pop off one to make room for the new one. So using this data structure in the context of creme might look like:

And this gives us a great start to tweak for the purposes I outlined needed above. Let's talk through that next.

Adding identifiers

My original attempt to add identifiers to river meant keeping a separate vector that would be in line with the features and labels, e.g.:

self._X[self._next_insert, :] = x
self._y[self._next_insert] = y
self._ids[self._next_insert] = "the-identifier"

but the super cool thing about the collections.deque is that we can... just add another dimension! So the update:

- self.window.append((x, y))
+ self.window.append((x, y, identifier))

Returning points

And then instead of trying to calculate a prediction, I just return the neighbors verbatim.

def predict_one(self, x: dict):
    """Predict the label of a set of features `x`.
    Parameters:
        x: A dictionary of features.
    Returns:
        The neighbors
    """
    return self.predict_proba_one(x)

def predict_proba_one(self, x):
    """
    This is modified to just return the nearest, not try to calculate
    a prediction because we just want the points back.
    """
    return self._nn.find_nearest(x=x, k=self.n_neighbors)

Filtering the Window

Okay so that entire thing was pretty quick! The original design of KNN in creme made it really easy to add the identifier as an extra dimension. But I ran into trouble quickly because I ran it the first time, saved the nearest 5 neighbors of each point to file, and then took a look. :scream:

I had gotten back neighbors that were frankly, kind of bad. I inspected things a bit closer and realized that I had two issues - first, the size of the window would determine the "memory" of my algorithm and thus if I pushed stuff out of the window, I'd no longer have the point. Given the default of 50 and my errors N=~250K obviously I was getting back nonsense. So I could make that larger. But the larger (no pun intended) of the two issues was that given that I had 100 of the same error in a row, my window was getting cluttered with redundant points. So what I decided to do is add a filter, or a decision point about whether to actually add a new point to the window or not. Specifically, I ONLY add a point given that there is no other point under some similarity threshold (e.g., I chose 0.05 since perfectly similar would be 0 and I want to allow for the tiniest bit of differences). I imagine you could play around with that value a bit (I did not). Once I did this, and had a larger window size, my matches looked decent (you can see lists of errors and matches with k=5 here). So after this, we have a nice implementation of KNN that keeps a memory of its data, and is able to intelligently know when to add new points.

This makes me think we could also have a similar intelligent way to remove - maybe find or keep track of two with the smallest similarity and remove that one - I will try this out!

Tiny optimization

The last tiny optimization I did was removing the collections.dequeue in favor of a list that I'd manage myself, and adding a set alongside to keep a quick lookup of errors (the x features joined by space that basically re-creates the error post tokenization) that are currently in the list. That way, if I can find a direct match (lookup time O(1)) I can skip calculating the distances. That saved me a whopping 6 seconds, lol, although to be fair that included the learning and predict, and really I should have just timed predict. But I'm happy with my 6 seconds.

Anyhoo the knn class is here and it's used here. Note that these scripts aren't anything server related, just a testing repository, hence the weirdo class with all the different models I've been trying!

It's not ideal to have a custom model that uses creme, but if it's a good start I don't mind adding to the server I'm developing, and maybe river can support something similar down the line? And if there are facets of that you'd want / like help contributing to river, please let me know! I do think these cases are fun to work on because probably a lot of what people will want for a production server might differ from the original design (which is totally okay!) It's fun to think about. As always, thanks for the fun work today, and for making these libraries!

smastelini commented 2 years ago

Great @vsoch! Feel free to bring (adapt, really) the creme version of KNN to River. I think it is straightforward to do, and it would be a nice contribution!

By the time of the merge between creme and skmultiflow, I, as a skmultiflow maintainer, started porting KNN as the first algorithm to be a test case. By that time, I ran some benchmarks, and the skmultiflow version was a little faster than the creme one. As you saw, increasing the window size makes the linear search... well, slower as we would expect. Scikit-multiflow had some tricks to reduce the impact of that. But, at that time, I had no idea of the impact of changing feature sets in real-world applications.

I could only see "computational performance" in front of me, and to be fair, that is my focus in my PhD thesis. While I still believe computational performance is crucial, I know that one of the most wonderful things about River is its flexibility and robustness facing weird real-world situations.

Hence, I think it is totally fine to bring the most straightforward implementation of k-NN back:

πŸ˜„ We gain flexibility and clarity in the code

😞 KNN might become slightly slower (but we have an overhead to convert the dictionaries to arrays and vice-versa, so who knows?)

No pressure, but feel free to create a PR to update k-NN (I mean, in its traditional learn and predict behavior).

I know it might take some time, but I am indeed "cooking" something promising for k-NN. Let's see how that goes!

vsoch commented 2 years ago

Sounds good @smastelini ! So would you imagine a set of classes like:

I definitely think we shouldn't replace the current implementation, I'm sure the performance bit is good for some cases! But what would you suggest to add this back?

smastelini commented 2 years ago

I think that the buffer flavor you mention can be achieved via a collections.deque + a linear search. IMHO, unless we can provide something more efficient than a linear search, I think it's better to let the user implement this kind of solution and store what they want in the way they want.

For instance, please check structures like sklearn's KDTree for API details of something fancier and suited for batch-based searches. My goal is to provide something like that in the future for generic use cases. But wrapping a deque with query-esque methods sounds like overkill to me; at least from the point of view of adding such a solution to River.

Some time ago I tried my luck with LSH, but it was too hard to configure on the fly.

Now, concerning k-NN, I would simply replace the current implementation. I personally think that robustness and flexibility are worth more than a few milliseconds here and there. Any thoughts, @MaxHalford and @jacobmontiel?

smastelini commented 2 years ago

KNN and Streaming Random Patches have been in my TODO list for quite some time now. They are the two ML models that currently don't support varying feature sets (as far as I remember). At least, those are the ones that keep me awake at night.

MaxHalford commented 2 years ago

I'm in favor of replacing the current implementation of KNN, I'm doubtful as to the performance gains. I would just do something simple (deque + exhaustive search).

vsoch commented 2 years ago

Just was about to comment!-I can give it a shot! I've tested it here and I wound up removing deque in favor of just managing a list and set on my own - the reason was (as I mentioned above) I decided it didn't make sense to add redundant points, and the model performed a lot better with a more diverse window to choose from.

MaxHalford commented 2 years ago

the reason was (as I mentioned above) I decided it didn't make sense to add redundant points, and the model performed a lot better with a more diverse window to choose from.

That sounds like a parameter to add to the model. Or even a prior step in the pipeline.

vsoch commented 2 years ago

Yep! It's a parameter - a threshold to decide for addition. So when set to 0.0 all points are added, something like 0.05 I found as a good default to not add exactly the same ones (but still pretty similar).

MaxHalford commented 2 years ago

That sounds really cool actually!

vsoch commented 2 years ago

ok this sounds super fun! I'll start working on it soon (this week) and get in a PR to start discussion. Likely I start with simple models and bases, and we can discuss how to tweak (or bring back in features/parameters from the previous implementation).

MaxHalford commented 2 years ago

I'm closing this issue because it (successfully) led to the refactoring of the KNN models.

There is still an issue of assigning IDs to clusters. The more I think about it, the more I think that predict_one is not enough. We also need some (possibly optional) mechanism to list the observations assigned with a given cluster (and therefore we also need a method to list clusters). So something like sklearn's labels_ property each of their clustering algorithm has, the difference being that in our case it might change over time. I think this should be made clear in the API and made consistent across all of our clustering algorithms.

FYI @hoanganhngo610 @Dennis1989