Using bmab_gi_multiple - Githubissues

Hi James, first thanks for publishing this repo! As far as I'm aware it is the first and only code implementation capable of calculating the Gittin's index.

Three years ago here on Math Exchange I asked for help computing the Gittin's index. To this day the only answer on there is "its really hard to calculate the Gittin's index". Despite my best efforts I'm still unable to solve even the most trivial of examples.

I read the readme.md, but I'm afraid I don't understand how to use bmab_gi_multiple. The discount rate of gamma makes sense, and I can probably use trial-and-error to understand tol and N, but I don't know what alpha_start, and beta_start would correspond to in a real-world multi-armed bandit problem, or why the output would be in the form of a triangular matrix.

The example I've been trying to solve for three years is this:

I'm at a casino, there are 3 machines M_1, M_2, M_3 Every machine pays out $1 for a win, $0 for a loss I've played M_1 three times, it has 2 wins 1 loss I've played M_2 four times, it has 2 wins 2 losses I've played M_3 two times, it has 1 win 1 loss

If I discount future payouts by 50%; What is the Gittins Index of M_1?

If you can show me how to use bmab_gi_multiple_ab (or bmab_gi_multiple_ab) to solve this, I would be extremely grateful. I've been wanting to create a python implementation of this for some time now.

As a side note: I read your paper, which was by far the most understandable out of all the papers I've read on Gittin's index. Sadly though, despite being a graduate student in CS, I'm unable to reproduce the following expression in code

Screen Shot 2020-07-22 at 5 40 41 PM I don't know what operation the γEy refers to, or what the ; notation means in this context, whether the {}'s refer to some form of set-notation or if they are simply equivalent to parentheses. All of which are the same syntactical things stopping me from understanding basically every paper discussing Gittin's index.

@jeff-hykin I am coming from the future to tell you that I have the same situation here. The problem with Gittins Index in Bernoulli Bandits is that the state s is infinite which means it is impossible to compute Gittins Index. Additionally, Gittins Index considers being one of the Information State Search in terms of principles, which means its bandits are considered to be based on 3 factors as mentioned above <S, A, R>. But it can be truncated using $\alpha$ and $\beta$ parameters.

The formula provided by @jedwards24 in his paper has a lack in terms of annotation used adding the explanation of each one of them.

Now, It is very important to mention that @jedwards24 is using Upper Bound and the Lower bound in his code to compute the GI – which increased our confusion. (1) I would love to have an explanation regards the equation and how he was capable of Integrating bmab_kgi with bmab_giplus_value. (2) an explanation of the calibrate_arm function and what is it.

Thanks,

Hi @jeff-hykin,

Apologies, I was not aware of your post until I was notified of @AI-Ahmed's response. I would have expected to be notified by Github when you posted - I will have to have a look at my notification settings.

The documentation for the package is on the minimal side. When I wrote the package I probably had in mind researchers working in the area and relied on the paper for explanation. Mathematical notation is not as universal or clear as academics think it is, though, so I'm not surprised that you would find it confusing. I've since learnt that one document does not work for multiple audiences.

For your example, the simplest way is to use the bmab_gi_ab() function which gives one Gittins Index (GI) at a time. This form of the multi-armed bandit is Bayesian so depends on prior beliefs on the arm. Without getting into that, the standard approach is to start with a (1,1) prior on the arm which means that the alpha parameter is the number of wins + 1 and the beta parameter is the number of losses + 1. So in your M1, alpha = 3 and beta = 2.

The tol argument determines the accuracy of the result and is up to you. GIs can only be computed numerically so they are never exact, but they can be made arbitraily accurate given enough computing time. In this case, because the discount is steep, it does not take much computation to get a very accurate result. N is also a tradeoff of accuracy and computation time - increasing N will give a better or equally good result. It only becomes a practical issue as the discount rate gets nearer 1. For lower values, such as here, you can choose N big enough that it won't compromise accuracy. N=100 will be plenty enough here. So the code to run is:

bmab_gi_ab(3, 2, 0.5, 1e-5, 100) [1] 0.6288994

The GIs for M2 and M3 can be found similarly:

bmab_gi_ab(3, 3, 0.5, 1e-5, 100) [1] 0.5257613 bmab_gi_ab(2, 2, 0.5, 1e-5, 100) [1] 0.5358699

The bmab_gi_multiple_ab() function just does this for a number of win/loss combinations at once. So if you run: bmab_gi_multiple_ab(1, 1, gamma = 0.5, N = 100, num_actions = 10, tol = 1e-5) you will see the GIs from the single state code results above in the output matrix. The matrix contains GIs for all possible win/loss combinations that can be reached within 9 goes on the machine (starting with a (1, 1) prior). I can't remember why it is 9 and not 10.

On your question on the equation from the paper. The curly brackets are just parentheses. The ; just means take the max of the expressions on either side (a comma would look more natural). The $\gamma E_Y$ is gamma (the discount factor) multiplied by the expectation over Y. Y, the outcome of the arm pull, is a random variable. The formula is a general one but can be made simpler in the Bernoulli case as Y is either 0 or 1. Then the $E_Y(...)$ is $Pr(Y=1)V(\Sigma+1, n+1, \gamma, \lambda) + Pr(Y=0)V(\Sigma, n+1, \gamma, \lambda)$. That is, the probability of a win the value after the win + the probability of a loss the value after the loss.

I hope that is a bit clearer (although unfortunately very late). It has been a while since I looked at this or anything Gittins related, but maybe I will have a go at adding some documentation and some simplified examples since it seems wrong that there isn't a straightforward explanation of this anywhere.

@AI-Ahmed, your questions could use a longer answer too, but briefly, calibrate_arm successively reduces an interval in which the GI value lies. The lower and upper bound define the starting range of that search. For Bernoulli bandits these limits can be 0 and 1 which will work fine. The advantage to choosing tighter bounds is just to speed computation, although for the Bernoulli bandit it often won't make an important difference. I agree it confuses things, and I could have kept it all hidden, but the package's initial purpose was for research where it was useful to have everything visible and controllable.

(1,1) prior on the arm which means that the alpha parameter is the number of wins + 1 and the beta parameter is the number of losses + 1

By sheer coincidence, I discussed priors in-depth for the first time with a professor last week, so I've got somewhat of a feel for them.

Thank you so much for this post @jedwards24 !

Over the years I've become much more familiar with expectation notation, although for max I still didn't realize the semicolon was splitting the values (I was thinking max over an implicit domain of some kind). I still definitely wouldn't have been able to do that example. Imagining the abstract case of an expectation is pretty tough, so your explanation of Y helps a lot.

Having the decimal values for the simple example is priceless. From that, I'm completely confident I can go look at the code and figure out how everything else works. And I can probably read the paper afterwords with and entirely new understanding. I'll be making a python and JavaScript version since I still run into multi arm bandit problems all the time in code. And I'm incredibly excited, after 5 years, to finally post the answer to that math exchange question!

there isn't a straightforward explanation of this anywhere

Haha, this is true for GI in general. I read about Gittins index my sophomore year (2015) and the author made it sound as generic and common as a p-value. By senior year I asked every math and computer science professor in my university, emailed professors at other universities (who did not respond), read at least 10 papers on the subject, and posted that question. This is the first time in my life I've actually seen a Gittins index be calculated. I think I had started to believe, if even the experts never show examples, maybe it was impossible. I'm so glad to see that its not.

First of all, Thanks @jedwards24 for answering me, and thank you @jeff-hykin for contributing and getting back to this again.

I tried to send you via your academic e-mail @jedwards24, but I found that It has been closed.

For sake of clarification, please hang on with me for a few minutes. I want to clarify multiple aspects of Gittins Index, and I am sorry If it looked simple.

Gittins Index has been the new dilemma of years between (1974, and 1979) after Jhon Gittins Released his research about Indices and how they can be used in medical trials instead of random observations.

Lots of Proofs are being released in just 30 years after that – you can say top scientists from the mathematics department and computer science department created this dilemma of how can we use DP to solve this problem. Until 4 researchers start to make it happens with different approaches.

Two of them said that working with Gittins Index has to be containerized into a form of a table, this table exceeds too much computation (Prof. Richard Weber, J. Gittins).

Prof. Richard Weber (1992) said that we can treat this as a multi-arms with a discounted rate for each trial, with adding a probability of switching between the arms if we felt that the chosen arm started to gain lower rewards.

Whittle (1988), decided to say that we can find a better way with Dynamic Programming to solve the multi-armed bandit problem without this table, by calibrating the arms (which adds a new variable which is lambda and known posterior reward [gained from the alpha mean with beta]).

Unfortunately, It was a theoretical approach and no one cared about implementing it, until 2011.

In 2011, J. Gittins released his book that contains multi-different approaches that solve the exploration Vs. exploitation dilemma. After one year, 2 scientists released a paper called CHAPTER 18 – OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING by Warren P. Pawell and Ilyn O. Ryzhov based on Gittins's book about the algorithm of Gittins index. This paper has the explanation of creating a Gittins Index table using the Value function by integrating interval [0, 1] of rewards with a number of arms and start to approximate both Gittin Index and Value Function using the Offline approach mentioned in Gittins book.

This approach is in the market right now and is used among researchers, which is relying on previous approximated values and randomized arms.

You can find it here: https://github.com/bobbingbobb/dexnet_py3/blob/cccf93319095374b0eefc24b8b6cd40bc23966d2/src/dexnet/learning/discrete_selection_policies.py

After all this information – I prefer your approach @jedwards24 because it relies on two things I prefer; (1) It relies on actual reward distribution [since you are using Bayes Law], and (2) your approach takes into account Low Bound, and the Upper Bound of each Bernoulli Beta Distribution, which helps to accomplish what Whittle has talked about 35 years ago.

Now, after explaining the situation, I was to clarify my problem and how we can use your approach to solve it.

I have Stationary/Non-stationary Bernoulli Environments that produce the agent with 3 updating variables after assigning n_actions to it; (1) A vector of successful ($\alpha$), (2) A vector of failure ($\beta$), and (3) total_pulls which refers to the total number of pulls the agent has pulled from the environment [stored in the total_pulls variable of the agent], Bernoulli – if the probability of action is less than the random number, it will return 0 [stored in ($\beta$) variable of the agent]. Else, it will return 1 [stored in ($\alpha$) variable of the agent].

The agent has a method called get_action. get_action is a method we use to calculate which arms should we choose in Bernoulli (online) – which means on each iteration.

I previously calculated Epsilon Exploration, USB1, Thomson Sampling, and Bayesian UCB using this ideology. But I still trying to understand the algorithm that you have written to apply it within my Bernoulli Environment.

I am not good enough with R to understand the bmab_giplus_value function, honestly. This is the only method which really confused me.

If you have any time to explain this function so that we can interpret it with python, that would be a courtesy from you.

I don't understand where the dbeta or continue function came from and also what integrate represents.

Thank you again for reading all of what I have written. Please, if you have Linkedin, It would be a courtesy to add you, too.

Hi @AI-Ahmed,

When I wrote the paper accompanying this code I was limited in what I could include due to aiming to meet journal page limits. This also limited how much detail I could include and how thoroughly I could explain everything. It is absurd that academic publishing still works this way. I did not include detail on the bounds but it can be found in Chapter 5 of my thesis which can be found at https://eprints.lancs.ac.uk/id/eprint/84589/1/2016edwardsphd.pdf. The chapter is an earlier version of the paper but, because space was less limited, it contains some extra information not in the later paper. See 5.2.1 for detail on the upper bound. I needed a bit of time myself to work out what the code does.

The GI+ is an upper bound to the GI (it could be used as an approximation to it). It is the index of the arm if you assume that the true reward of the arm is revealed after the next pull. Like the GI, you need calibration to calculate it, but it is much quicker than the GI calculation. After the next pull you will know whether to continue with the arm or choose the safe arm. Call the arm's true reward $\theta$ and the reward of the safe arm $\lambda$. Then if $\theta>\lambda$ we will continue to play the arm for ever, and we will play the safe arm forever otherwise. We do not know $\theta$ yet so we have to work with the distribution of the true arm reward. After a success this is $Beta(\Sigma + 1, n - \Sigma)$. The dbeta() function gives the density of a beta distribution so x * dbeta(x, Sigma + 1, n - Sigma) gives $xPr(\theta=x)$. Integrating this from lambda to 1 gives the expected (undiscounted) reward for the following play when $\theta>\lambda$. The lambda * pbeta(...) bit gives the expected reward when $\theta<=\lambda$ (simpler to calculate since $\lambda$ is a constant). Together, with appropriate discounting, they give the full expected reward.

Let me know if you have further questions and good luck in your work.

Regards, James

Thank you so much @jedwards24 I appreciate your hard work and how consistent it is. I will check your thesis. Unfortunately, for now, since I am doing this just for fun, I implemented the Whittle equation that he released in 1980 which was covered in Gittins and Weber's book in 2011.

Covering the exploration in PhD or MSe is quite risky since it is one of the major principles of having an accurate generalization agent.

Regards,

jedwards24 / gittins

Using bmab_gi_multiple #1

The example I've been trying to solve for three years is this: