Closed kaposnick closed 2 years ago
Hi Nikos,
Thank you for your interest in this.
Before we start, important disclaimer: this is quite an old work of mine I did for a class project back in the days, and I am not really involved with neural bandits at the moment (although I am doing research on other classes of contextual bandits). As such, this repo may contain errors and inaccuracies.
The general form of confidence bound suggested in [1] is $\gamma_t \sqrt{gt^\top A^{-1}\{t-1} g_t}$, where $gt$ are the gradients of the neural network function w.r.t. network parameters, $A{t-1}$ the Gram (or design) matrix of the regression problem and $\gamma_t$ is a user-specified hyperparameter (see p28 of my slides, available in this repo). The reason why the gradients $g_t$ appear here is detailed in the paper: essentially, a neural network with large enough layers can be closely approximated by a certain kernel regressor (the so-called neural tangent kernel), the feature maps of which are given by precisely these gradients.
Now, the authors of [1] suggest a specific tuning for $\gamma_t$, which comes from their theoretical analysis. The resulting expression of $\gamma_t$ is a bit ugly and potentially quite loose, due to the complexity of the theoretical analysis of neural network regression. In particular, it depends on a number of arbitrary constants $C_1, C_2, C_3$ (you know there exists such constants for which the regret of the neural bandit algorithm will be small, but you do not really know what their values are) and on the sub-Gaussian parameter of the reward noise $\nu$ (again, this parameter is not really known in practice), which makes this expression quite impractical. From a practitioner point of view, it is quite standard to view $\gamma_t$ as a hyperparameter to be tuned manually by the user; increasing $\gamma_t$ promotes more aggressive exploration (as opposed to exploitation), and vice-versa.
The expression I used in my implementation of NeuralUCB is directly inspired from the linear bandit case, where the theoretical analysis is simpler and thus results in a more accessible confidence bound multiplier: $$\gamma_t = \nu \sqrt{d \log\left(1 + \frac{t L^2}{\lambda d} \right) + 2 \log\left( \frac{1}{\delta}\right)} \,,$$ where $L$ is an upper bound on the norm of the contexts and $\lambda$ the $L^2$ regularization parameter (actually, I realized today that I had forgotten an extra $\sqrt{\lambda} S$ term, where $S$ is an upper bound on the norm of the linear parameter $\theta$ -- again it should not be too bad as this is simply a heuristics, but you may want to add it). I refer you to [2] if you are unfamiliar with the linear bandit case (this paper is at the root of most modern works on contextual bandits and is a must-read if you are interested in this domain). I have replaced the dimension $d$ of the linear parameter by the dimension of the neural network, i.e. the sum of dimensions of all trainable layers (there may be better ways to do this, for instance to use the notion of effective dimension).
Note that the role of $\nu$ is played in the code by self.confidence_scaling_factor. As I mentioned, this parameter is intended to be viewed as a hyperparameter to be tuned manually: feel free to try different values and see whether is provided too much, not enough, or (hopefully!) just the right amount of exploration. You may try to replace it with the values recommended by [1], but I suspect their analysis, being not very sharp, will lead to much overexploration.
Finally, I believe more recent works have been done on using neural regressors for contextual bandits, and they may provide better, more practical analysis. However, my feeling is that the theoretical analysis of such bandits is still not very sharp, so if you are interested in the practical performances, you may still have to rely on the sort of heuristics I used. If you are interested on the theory, another class of models that sits in between linear and neural bandits and is becoming quite well understood is that of generalized linear bandits, see e.g. [3].
Let me know if something is unclear. Patrick
[1]: Zhou, Dongruo, Lihong Li, and Quanquan Gu. "Neural contextual bandits with ucb-based exploration." International Conference on Machine Learning. PMLR, 2020.
[2] Abbasi-Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári. "Improved algorithms for linear stochastic bandits." Advances in neural information processing systems 24 (2011).
[3]: Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.
Thank you Patrick for the thorough exlpanation. I had problem interpreting the variables $C_1, C_2, C3$ and the $\gamma{t}$ on the paper but it makes sense to be handled as hyperparameters.
Cool, I will close the issue, feel free to reach out if you have more questions.
At first, thank you a lot for your contributions. They are very valuable to improve my understanding of the original paper.
I have a fundamental question regarding the implementation of the NeuralUCB - confidence multiplier.
How is it exactly concluded from the original paper?