Implement bandits - Githubissues

online-ml / river

🌊 Online machine learning in Python

https://riverml.xyz

BSD 3-Clause "New" or "Revised" License

5.1k stars 552 forks source link

Implement bandits #270

Closed MaxHalford closed 3 years ago

MaxHalford commented 4 years ago

We now have SuccessiveHalvingClassifier and SuccessiveHalvingRegressor in the model_selection to module to perform, well, model selection. This allows doing hyperparameter-tuning by initializing a model with different parameter configuration and running them against each other. All in all it seems to be working pretty well and we should be getting some feedback on it soon. The implementations are handy because they implement the fit_one/predict_one interface and therefore make the whole process transparent to users. In other words you can use them as you would any other model. This design will go a long way and should make things simple in terms of deployment (I'm thinking of you chantilly).

The next step would be to implement multi-armed bandits. In progressive validation, all the remaining models are updated. This is called a "full feedback" situation. Bandits, on the other hand, use partial feedback, because only one model is picked and trained at a time. This is more efficient because it results in less model evaluations, but might also converge more slowly. Most bandit algorithms assume that the performance of the models is constant through time (this is called "stationarity"). However, the performance of each model is bound to change through time because the act of picking modifies the model. Therefore ideally we need to looking into non-stationary bandit algorithms. Here are some more details and references.

Here are the algortithms I would like to see implemented:

Also see this paper. We can probably open separate issues for each implementation. I think that the current best way of proceding is to provide one implementation for regression and one for classification in each case, much like what is done for successive halving. There's also this paper that discusses delayed feedback and how it affects bandits.

etiennekintzler commented 4 years ago

I guess here that the bandits algorithms are to be implemented in the special case of model selection / hyper parameter optimization. Assuming the number of arms to be the number of models from which to select from, it would make sense to compute the loss for a given example using predict_one and the true y value and then use fit_one for the arm/model that has been pulled. However it's just my intuition and I have no rationale nor do I know theoretical guarantees for this ; is there any paper / textbook entry that gives a blueprint on how to apply bandits to model selection / hyper parameter optimization ?

Searching on Google, hyperband algorithm (also referenced in https://creme-ml.github.io/api-reference/model-selection/SuccessiveHalvingClassifier/) seems to be a popular way to apply bandit to the problem. Also hyperband seems to be an extension of successive halving and hence fit nicely in what you already have. However I am not sure whether it has been already implemented in creme since a PR refers to it (https://github.com/creme-ml/creme/pull/180 ) but in the meantime I didn't see an hyperband function in the doc.

Also, using contextual information from the feature could be a nice thing to have ! I saw you refer to the Generalized Linear Bandit that both tackle in the same time non stationnarity and use contextual information. As a first step, maybe LinUCB could be implemented before tackling non stationnarity.

What is the current state of the issue (I saw that @VaysseRobin is mentionned too) ? I would gladly contribute on this however I'd need some guidance / reference about the way to implement bandit for this special issue.

MaxHalford commented 4 years ago

I guess here that the bandits algorithms are to be implemented in the special case of model selection / hyper parameter optimization.

Yes, indeed. I like to frame hyper parameter optimization as a special case of model selection. The expand_param_grid function can be used to this purpose for generating many models from a grid of parameters.

... it would make sense to compute the loss for a given example using predict_one and the true y value and then use fit_one for the arm/model that has been pulled.

Yes, that would be ideal. The issue is that this would require some mechanism for identifying a sample so that the prediction made during predict_one can be matched with a ground truth during fit_one. It also requires storing some information whilst waiting for the ground truth to arrive. An alternative solution is to make a prediction during the fit_one call. This is what we do in the current models of the expert module. This is not optimal in terms of required computation, and maybe that it's suboptimal from a theoretical point of view. However, I think it's fine for the time being, and we can explore a different approach further down the line. For instance, we could imagine a different API where an ID has to be passed during fit_one and predict_one.

... is there any paper / textbook entry that gives a blueprint on how to apply bandits to model selection / hyper parameter optimization?

As far as I can tell: no, there isn't. On the contrary, I believe that this in uncharted territory in terms of research, and therefore there's space for building something useful. In fact, we could kill two birds with one stone, as online model selection could serve as a benchmark for evaluating bandits.

With regards to the Hyperband algorithm, I initially wanted to implement it, but then realised that it requires making multiple passes over the data. There's probably a way to reframe it, but I didn't take the time to do so.

I think that LinUCB is a good start, but all the simple bandit algorithms such as epsilon-greedy are worth looking at too. I prefer quality over quantity, but obviously for the moment we don't know what works best. We may as well implement what we can and we'll benchmark/analyse the performance of each method. @VaysseRobin is busy with his PhD at the moment and doesn't have the time to work on creme. I myself was thinking of working on this over the summer, but I'm more than happy if you take the lead on this. As I said, this is a complex but interesting topic that has the opportunity of being very useful. I have also contacted @YRussac, who has Aurélien Garivier as a PhD advisor, and is thus big into non-stationary bandits. He's a bit busy at the moment but is interested in discussing things more a somewhat theoretical perspective.

etiennekintzler commented 4 years ago

I guess here that the bandits algorithms are to be implemented in the special case of model selection / hyper parameter optimization.

Yes, indeed. I like to frame hyper parameter optimization as a special case of model selection. The expand_param_grid function can be used to this purpose for generating many models from a grid of parameters.

Yep, this is how I see the problem as well.

... it would make sense to compute the loss for a given example using predict_one and the true y value and then use fit_one for the arm/model that has been pulled.

Yes, that would be ideal. The issue is that this would require some mechanism for identifying a sample so that the prediction made during predict_one can be matched with a ground truth during fit_one. It also requires storing some information whilst waiting for the ground truth to arrive. An alternative solution is to make a prediction during the fit_one call. This is what we do in the current models of the expert module. This is not optimal in terms of required computation, and maybe that it's suboptimal from a theoretical point of view. However, I think it's fine for the time being, and we can explore a different approach further down the line. For instance, we could imagine a different API where an ID has to be passed during fit_one and predict_one.

Yes doing something in the same fashion EWARegressor.fit_predict_one would be a good first step. The benefit is that it would handle for the user the complete bandit sequence (pull_arm() --> compute_reward(arm, y) --> update_arm(arm, reward)). In this sequence, the predict_one of the model behind the chosen arm would be used in compute_reward (which would be a negative function of the loss) and the fit_one of the model behind chosen arm used in the update_arm phase. For this last step, note that there are 2 kinds of update/fitting :

the arms parameters (defined by arrays/dict N and Q in Epsilon-Greedy and UCB).
the models (based on fit_one).

With regards to the Hyperband algorithm, I initially wanted to implement it, but then realised that it requires making multiple passes over the data. There's probably a way to reframe it, but I didn't take the time to do so.

OK !

... is there any paper / textbook entry that gives a blueprint on how to apply bandits to model selection / hyper parameter optimization? As far as I can tell: no, there isn't. On the contrary, I believe that this in uncharted territory in terms of research, and therefore there's space for building something useful. In fact, we could kill two birds with one stone, as online model selection could serve as a benchmark for evaluating bandits.

OK, so I guess the empirical benchmark would be the principal guide. There is implementation details I am not sure about: should the fit_one be applied on the model behind the chosen arm OR on all models ?

To train just the model behind the chosen arm seems natural but it has pitfall: we don't know if the less rewarded models are so because they are bad (i.e the model, defined by its hyperparameters, yield less accurate predictions) OR if it's just because they have been trained and less data (since pick less often). This is even more true for non-stationnary problem were models who were updated more recently might be advantaged compared with the models that were updated at the beginning (even if the number of updates were the same for each model).

I think that LinUCB is a good start, but all the simple bandit algorithms such as epsilon-greedy are worth looking at too. I prefer quality over quantity, but obviously for the moment we don't know what works best. We may as well implement what we can and we'll benchmark/analyse the performance of each method.

Yes of course, I just suggest it as a preliminary step before implementing non stationnary generalized linear bandit. I think the order of implementation should follow the algorithms complexity :

multi armed bandit : epsilon greedy, UCB
contextual bandit : linUCB
non stationnary multi armed bandit
non stationnary contextual bandit

@VaysseRobin is busy with his PhD at the moment and doesn't have the time to work on creme. I myself was thinking of working on this over the summer, but I'm more than happy if you take the lead on this. As I said, this is a complex but interesting topic that has the opportunity of being very useful. I have also contacted @YRussac, who has Aurélien Garivier as a PhD advisor, and is thus big into non-stationary bandits. He's a bit busy at the moment but is interested in discussing things more a somewhat theoretical perspective.

OK great if we can have an academic perspective on this ! I can have a go on the epsilon greedy. I can benchmark to see how it compares with SuccessiveHalving* (in terms of chosen model, performance on training set and computation time) then if it look good to you I can do a PR.

Do you want that I report the tests in this issue or elsewhere ?

MaxHalford commented 4 years ago

OK, so I guess the empirical benchmark would be the principal guide. There is implementation details I am not sure about: should the fit_one be applied on the model behind the chosen arm OR on all models ?

Well I would say that only the arm that was picked should be updated. The whole point of using bandits is to reduce the total amount of computation time. In predict_one, we can basically just use the best arm we have seen up to now. Tell if this doesn't make sense to you. We could thus compare a bandit algorithm to an expensive oracle which updates every model and uses the best model during predict_one.

To train just the model behind the chosen arm seems natural but it has pitfall: we don't know if the less rewarded models are so because they are bad (i.e the model, defined by its hyperparameters, yield less accurate predictions) OR if it's just because they have been trained and less data (since pick less often). This is even more true for non-stationnary problem were models who were updated more recently might be advantaged compared with the models that were updated at the beginning (even if the number of updates were the same for each model).

That's the challenge of all bandit algorithms.

OK great if we can have an academic perspective on this ! I can have a go on the epsilon greedy. I can benchmark to see how it compares with SuccessiveHalving* (in terms of chosen model, performance on training set and computation time) then if it look good to you I can do a PR.

Great! I would suggest also implemented the expensive oracle model I mentioned above. It could be called BestClassifier or something like that.

Do you want that I report the tests in this issue or elsewhere ?

You may as well report your findings here :)

etiennekintzler commented 4 years ago

OK, so I guess the empirical benchmark would be the principal guide. There is implementation details I am not sure about: should the fit_one be applied on the model behind the chosen arm OR on all models ?

Well I would say that only the arm that was picked should be updated. The whole point of using bandits is to reduce the total amount of computation time. In predict_one, we can basically just use the best arm we have seen up to now. Tell if this doesn't make sense to you.

Yes updating the model behind the chosen arm is what is the most natural but I am afraid that some model lag behind because they just didn't get enough data to beginning with. This can however be solved using more exploration at the beginning of the process ; for instance higher epsilon for epsilon greedy or using an explore first strategy.

We could thus compare a bandit algorithm to an expensive oracle which updates every model and uses the best model during predict_one.

Yes totally agree. Basically it amounts to compute the regret at each step (+ updating all models).

To train just the model behind the chosen arm seems natural but it has pitfall: we don't know if the less rewarded models are so because they are bad (i.e the model, defined by its hyperparameters, yield less accurate predictions) OR if it's just because they have been trained and less data (since pick less often). This is even more true for non-stationnary problem were models who were updated more recently might be advantaged compared with the models that were updated at the beginning (even if the number of updates were the same for each model).

That's the challenge of all bandit algorithms.

Yes but I would say that the uncertainty over the reward/loss distribution is stronger in our case. If you consider classical multi armed bandit, each arm has its fixed distribution and the reward it provides does not depend on how many time you've pulled it. Of course pulling more time will give better approximation about the distribution of the arm but it's not the point here. In our case, the reward is link to how well we predict, which depends not only on the model structure (hyperparameters) but also on how much data has been fed to the model. It's as if the reward distribution was translated to the right as you give them more data until the model is totally fitted.

OK great if we can have an academic perspective on this ! I can have a go on the epsilon greedy. I can benchmark to see how it compares with SuccessiveHalving* (in terms of chosen model, performance on training set and computation time) then if it look good to you I can do a PR.

Great! I would suggest also implemented the expensive oracle model I mentioned above. It could be called BestClassifier or something like that.

Yes ! The oracle is a great idea, I will do it.

Do you want that I report the tests in this issue or elsewhere ?

You may as well report your findings here :)

OK good for me too !

MaxHalford commented 4 years ago

Yes updating the model behind the chosen arm is what is the most natural but I am afraid that some model lag behind because they just didn't get enough data to beginning with. This can however be solved using more exploration at the beginning of the process ; for instance higher epsilon for epsilon greedy or using an explore first strategy.

Yep, every bandit could have a "warm-up" period where the models are selected in sequential order.

Hm not really, I think it's quite particular to our case. If you consider classical multi armed bandit, each arm has its fixed distribution and the reward it provides does not depend on how many time you've pulled it. Of course pulling more time will give better approximation about the distribution of the arm but it's not the point here. In our case, the reward is link on how well we predict, which depends on the model structure (hyperparameters) AND how much data has been fed to the model behind the arm.

Indeed, you're right, I was too hasty reading your comment. This is a reason why I would like to explore non-stationary bandits, as maybe they might yield better performance.

etiennekintzler commented 4 years ago

Indeed, you're right, I was too hasty reading your comment. This is a reason why I would like to explore non-stationary bandits, as maybe they might yield better performance.

I edited my answer to give more details in the meantime :-)

Regarding non stationarity, does creme have options in models to handle it (time weight decay or something else) ?

MaxHalford commented 4 years ago

Regarding non stationarity, does creme have options in models to handle it (time weight decay or something else) ?

Not explicitly, no. However, it turns out that models that use stochastic gradient descent handle drift naturally. There are other models, such as those in the naive_bayes module, which could handle drift, but don't do at the moment.

etiennekintzler commented 4 years ago

I will use this comment as a reminder for related papers (although most papers are about more complex forms of bandits) :

MaxHalford commented 4 years ago

Just some news on this @brcharron: we're currently deep into the process of merging with scikit-multiflow. I'll get back into this topic once we've made more progress on the merge :)

etiennekintzler commented 4 years ago

I've done some progress on the subject.

I compute the performances of all possible models from a given parameters grid and check where the best model selected by bandit procedure and successive halving procedure fall.

For example, for a given grid parameters and dataset, I'll plot the following graph:

In this very example, the model selected by the e-greedy has higher performance (defined as cumulative reward). However both procedures fail to identify the best model (which performance hovers around 3500).

However the preliminary results are not very distinct, meaning that sometimes the bandit procedure select better model than successive halving, sometimes it's the other way around.

Also e-greedy has the following specificities :

not deterministic procedure, results also vary in this regard.
performance improve if you train all the models (and not just the model corresponding to the best arms) on the first 100 examples. If you don't this the bandit might fail to select good models.
you need to specify a way to compute reward from the prediction loss. The function need to be decreasing of prediction loss but there are a lot of choices possible (ex: exp(-loss), 1/(1+loss), -loss).

I will try to do a table that give the performance percentiles for bandit and successive halving procedures for multiple experiments (different grid parameters and/or datasets) and report it here. If the results are interesting we can discuss about how to integrate it to the current code (before I can work on the PR).

MaxHalford commented 4 years ago

Awesome work @etiennekintzler!

I think this is already great because it shows that successive halving isn't necessarily the best solution for online learning. I definitely believe that when we're done we can think about writing a technical report with all these (your) findings.

performance improve if you train all the models (and not just the model corresponding to the best arms) on the first 100 examples. If you don't this the bandit might fail to select good models.

Indeed I think allowing for a warm-up period is crucial. Not sure if this has a name in the literature but it makes a lot of sense.

If the results are interesting we can discuss about how to integrate it to the current code (before I can work on the PR).

Sounds good! What matters for the moment is to A) produce an model selection API that works B) implement something simple like e-greedy C) set up benchmarks like you just did so that newcomers can participate and propose new models.

C is important in particular because it would allow @YRussac to benchmark his linear bandits.

etiennekintzler commented 4 years ago

Awesome work @etiennekintzler!

haha thank you @MaxHalford but a big part of this work has yet to be done :smile:

I'll try to report here the first results within 1 or 2 months. I'll try to include 2 datasets with parameter grid from size ~ 10 to ~ 100 and multiple runs for each experiment to account for the stochasticity of e-greedy.

I think this is already great because it shows that successive halving isn't necessarily the best solution for online learning.

Yes the good thing about using bandit is that it fit well the online learning paradigm of creme that, for a given example x, the bandit procedure output the best model along with a prediction, then the chosen model is update based on the loss (along with the bandit internals). In this sense, it could run live (not sure if it's the case for successive halving).

However one of the drawback is that the models might be under-trained. Also, when comparing performance of bandit and successive halving procedure in the previous example, I took the oracle performances of the procedures' selected models, not the models actually trained by these 2 procedures (since the goal was to study model selection and not the performance / cumulative reward). Thinking of that, for the bandit I might include in the report the cumulative reward actually achieved by the procedure.

I definitely believe that when we're done we can think about writing a technical report with all these (your) findings.

Yes, if the results are conclusive, I would'd be pleased to help write such report.

performance improve if you train all the models (and not just the model corresponding to the best arms) on the first 100 examples. If you don't this the bandit might fail to select good models.

Indeed I think allowing for a warm-up period is crucial. Not sure if this has a name in the literature but it makes a lot of sense.

This is actually called warm-up :wink:

If the results are interesting we can discuss about how to integrate it to the current code (before I can work on the PR).

Sounds good! What matters for the moment is to A) produce an model selection API that works B) implement something simple like e-greedy C) set up benchmarks like you just did so that newcomers can participate and propose new models.

C is important in particular because it would allow @YRussac to benchmark his linear bandits.

Yes sure !

MaxHalford commented 4 years ago

Yes the good thing about using bandit is that it fit well the online learning paradigm of creme that, for a given example x, the bandit procedure output the best model along with a prediction, then the chosen model is update based on the loss (along with the bandit internals). In this sense, it could run live (not sure if it's the case for successive halving).

Indeed, that's the big selling point for bandits. One thing to note, however, is that in practice labels arrive with a delay. Therefore there has to be some mechanism for remembering which models made which prediction. Or not, maybe that the bandit selection only has to occur the training step, whereas the prediction is always made by the best model. Not sure if I'm being clear. There seems to be some litterature on bandits with delayed feedback, although I'm not sure it's relevant. On a side-note, I've also read that clipper implements some kind of bandit for selecting models (although as usual assume the models are pre-trained).

etiennekintzler commented 4 years ago

I benchmark the different methods (bandits such as epsilon-greedy and UCB, Sucessive Halving) on simulated data (as a sanity check) and check how well they are able to select the best model. The best model is defined as the one that maximize cumulative reward. Oracle is used to get the percentiles of the cumulative reward over the models under selection.

The conclusions for the simulations (detailed further down) are :

SH get the best performance on average : the models selected are in the 90-95% percentile of cumulative reward on average.
Bandits get less impressive performance (than SH) and UCB outperforms epsilon-greedy : models selected are in the 75-80 % percentile on average for UCB and 65-70% for e-greedy. This is for simulations with about 5 000 observations. For simulations with less observations (< 1 000) performance fall to about 60-70% for both bandit models.

Below is a plot of the distribution of the performance for each method across 40 simulated dataset, each one having 5 000 observations. The data generating process is: _y = \sum_j^p \beta_j*xj + \epsilon with p = 10. The different models correspond to different L2 regularization values : [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 2, 3, 4, 5]. The method "rand" selects a random model and is used for control. "ego" is epsilon-greedy. These results are for the basic implementation of bandits (no warm up).

That's a lot of information but the tl;dr is that bandits seem to work (as they perform better than random selection of model) and they need sufficient amount of observations to get nice performance (contrary to SH).

Also, I like the fact that bandits integrate nicely in the online learning framework. For instance the code looks like:

    bandit = EpsilonGreedyOptimizer(models)

    mae = metrics.MAE()

    for (x, y) in enumerate(data):

        chosen_arm, y_pred = bandit.pull_arm(x)

        bandit.update_arm(chosen_arm=chosen_arm, y=y, x=x)

        mae.update(y_pred=y_pred, y_true=y)

The two bandit steps can also be encapsulated in learn_predict_one as the other classes of the expert module. Also, more models could be added on the fly (not tested yet).

By the way I see that the library is to be merged with scikit-multiflow. Should I wait before working on the PR ?

MaxHalford commented 4 years ago

Looks good to me! It's nice that you benchmarked against a totally randomized method to verify that the bandit is actually learning something.

SH is more accurate, but it requires updating each model, whereas bandits only update one at a time, right?

What do you think about always predicting with the best arm, and calling the bandit only for picking which model to update. This is what is done in successive halving. As far as I understand, the only reason why a bandit chooses a model for predicting is that it will update said model when the ground truth arrives. But looking at your code, it looks like you're not using the predicted information in the update_arm method. To recap, the API of a bandit would be:

predict_one: make a prediction with the best arm.
learn_one: update a random following the bandit's strategy (explore/exploit). One could also imagine always updating the best model, and also updating another model via the bandit's selection strategy.

What do you think about this?

The two bandit steps can also be encapsulated in learn_predict_one as the other classes of the expert module.

Actually I've thought about it a bit, and I would rather discourage that API. In the scenarios we're catering to, there's never a case where you need to make prediction whilst having the ground truth at your disposal. In other words there's always a delay between the prediction and the ground truth. Makes sense?

By the way I see that the library is to be merged with scikit-multiflow. Should I wait before working on the PR?

We've just finished the merge (finally!). You can send pull requests to master :). Note that fit_one has been renamed to learn_one.

etiennekintzler commented 4 years ago

Looks good to me! It's nice that you benchmarked against a totally randomized method to verify that the bandit is actually learning something.

Yes, but I'd like to benchmark it on 1 or 2 real datasets.

SH is more accurate, but it requires updating each model, whereas bandits only update one at a time, right?

Bandit update one model at a time yes !

What do you think about always predicting with the best arm, and calling the bandit only for picking which model to update.

Unless you fall on the "random" part of the epsilon greedy (in epsilon % of the time), the arm/models pulled in UCB and e-greedy is always the best one according to a criterion (vector Q).

This is what is done in successive halving. As far as I understand, the only reason why a bandit chooses a model for predicting is that it will update said model when the ground truth arrives. But looking at your code, it looks like you're not using the predicted information in the update_arm method.

The method pull_arm returns both the best arm/model (according to the criterion) along with the prediction of this model (using predict_one). The arm pulled (i.e the best arm) is then passed to the parameter chosen_arm of the method update_arm to update the model behind the arm (using fit_one). Actually, in the code example I gave, the chosen_arm should be renamed best_arm.

To recap, the API of a bandit would be:

predict_one: make a prediction with the best arm.

learn_one: update a random following the bandit's strategy (explore/exploit). One could also imagine always updating the best model, and also updating another model via the bandit's selection strategy.

What do you think about this?

I don't see the difference between the best arm and the arm selected by the bandit strategy, as the bandit strategy is supposed to draw the best arm (for UCB and e-greedy unless epsilon % of the time). Also, the explore/exploit strategy is often embodied in the criterion to maximize (for UCB and LinUCB's contextual bandit). Could you define what is the best arm if it's not the one that results from the bandit's selection strategy?

The two bandit steps can also be encapsulated in learn_predict_one as the other classes of the expert module.

Actually I've thought about it a bit, and I would rather discourage that API. In the scenarios we're catering to, there's never a case where you need to make prediction whilst having the ground truth at your disposal. In other words there's always a delay between the prediction and the ground truth. Makes sense?

I think it's indeed more healthy to distinguish the best arm's selection that yields the prediction from its update. However, there has to be a mechanism to tell which arm to update at time t with the information (x_t, y_t) of that time (even if y_t arrive with delay). In the example code, this information is carried using chosen_arm.

I can make it fit in the framework predict_one and learn_one :

predict_one will pull the best arm and yields its prediction (without returning the best arm contrary to what I did).
learn_one pull the best arm and update it using (x, y).

Note that the best arm is computed twice in this case (since the information is not carried using chosen_arm). Also, if there is randomness in the selection (for instance in e-greedy or Thomson sampling) it's possible that the best arm in the learn_one step is not the same as the one pulled in predict_one.

By the way I see that the library is to be merged with scikit-multiflow. Should I wait before working on the PR?

We've just finished the merge (finally!). You can send pull requests to master :). Note that fit_one has been renamed to learn_one.

Ok, great 👍

MaxHalford commented 4 years ago

Hey @etiennekintzler, how would like to do a quick conference call to align? We might move forward faster than by written way :)

etiennekintzler commented 4 years ago

Sure @MaxHalford, I sent you a mail with my info (at the adress in the code of conduct).

I've begun to work on the PR on a fork ; you can check the following file https://github.com/etiennekintzler/river/blob/bandits/river/expert/bandit.py , it might clarify what I am doing (it's not a definitive version ofc).

etiennekintzler commented 3 years ago

Hey @MaxHalford, I run accross this paper: Online Model Selection: a Rested Bandit Formulation which problematic is really close to what we're trying to do here.

You might want to check this out ; it's mainly theoretical but it still interesting to see how they lay out the problem. Also they said that they are planning to do experimental analysis in future work.

Some extracts :

In contrast, in this paper we are interested in a non-stationary stochastic bandit setting called rested bandits [2, 5, 8, 17, 22, 31]. Here, the feedback/losses received upon pulling arms are not i.i.d. anymore. Instead, the distribution of losses changes as a function of the number of times each arm has been pulled so far. [...] A striking motivation behind assumptions like (1) is the study of online model selection problems with bandit feedback. Here, at each round the only observed feedback is the one associated with the selected arm, where each arm represents a learning device, so that arms could also be referred to as base learners. The online model selection problem when we restrict to base learners which are themselves bandit algorithms has recently received a lot of attention (e.g., [1, 13, 12, 27]). Yet, we would like to emphasize that, in our setting, the base learners could be any generic learning devices (like different neural network architectures) that satisfy Equation (1).

YRussac commented 3 years ago

Hi everyone, The author of this paper is one of my friend and the topic is closely related to what I am doing in my PhD if you want more information about the paper i am happy to discuss it. Enjoy your holidays !

Yoan

Le 25 déc. 2020 à 23:50, Etienne Kintzler notifications@github.com a écrit :

Hey @MaxHalford, I run accross this paper: Online Model Selection: a Rested Bandit Formulation which problematic is really close to what we're trying to do here.

You might want to check this out ; it's mainly theoretical but it still interesting to see how they lay out the problem. Also they said that they are planning to do experimental analysis in future work.

Some extracts :

In contrast, in this paper we are interested in a non-stationary stochastic bandit setting called rested bandits [2, 5, 8, 17, 22, 31]. Here, the feedback/losses received upon pulling arms are not i.i.d. anymore. Instead, the distribution of losses changes as a function of the number of times each arm has been pulled so far. [...] A striking motivation behind assumptions like (1) is the study of online model selection problems with bandit feedback. Here, at each round the only observed feedback is the one associated with the selected arm, where each arm represents a learning device, so that arms could also be referred to as base learners. The online model selection problem when we restrict to base learners which are themselves bandit algorithms has recently received a lot of attention (e.g., [1, 13, 12, 27]). Yet, we would like to emphasize that, in our setting, the base learners could be any generic learning devices (like different neural network architectures) that satisfy Equation (1).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

MaxHalford commented 3 years ago

I just finished reading the paper, it's really good. It's uncanny how this is exactly what we're trying to do.

In an ideal world, we could ask the authors to perform the "experimental analysis" they refer to with River. The same goes for you @YRussac: we would love to see the stuff you've worked on implemented in River. But I don't know what everyone's plans are. The only thing that's clear to me is that we all have an interest in bandits/online model selection. The community as a whole would benefit from some reference implementations and clear benchmarks.

Maybe we should start with that: write some benchmarks to compare bandit strategies. Ideally this could be done in collaboration with the author of the paper and @YRussac. It's a bit far-fetched but why not? This would lead the way in a new direction we would like to pursue with River: propose a set of benchmarks to compare online machine learning algorithms in a principled manner (cc @jacobmontiel @smastelini).

YRussac commented 3 years ago

Hi everyone, I think that this is a great idea. I can try to organise a call with Leonardo this should not be a problem. Let’s discuss this more into details after the Christmas holydays ? Your work is pretty exiting ! Yoan

Le 26 déc. 2020 à 21:58, Max Halford notifications@github.com a écrit :

I just finished reading the paper, it's really good. It's uncanny how this is exactly what we're trying to do.

In an ideal world, we could ask the authors to perform the "experimental analysis" they refer to with River. The same goes for you @YRussac: we would love to see the stuff you've worked on implemented in River. But I don't know what everyone's plans are. The only thing that's clear to me is that we all have an interest in bandits/online model selection. The community as a whole would benefit from some reference implementations and clear benchmarks.

Maybe we should start with that: write some benchmarks to compare bandit strategies. Ideally this could be done in collaboration with the author of the paper and @YRussac. It's a bit far-fetched but why not? This would lead the way in a new direction we would like to pursue with River: propose a set of benchmarks to compare online machine learning algorithms in a principled manner (cc @jacobmontiel @smastelini).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

MaxHalford commented 3 years ago

The first brick has been implemented and merged, big thanks to @etiennekintzler for the effort! I'm closing this now.