Questions about algorithm design

puyuan1996 commented 1 year ago

Hello,

Thank you very much for your great work. I have 3 questions and hope to get your answers.

As far as I know, complex and large action spaces are a significant challenge for MCTS algorithms. For the sort3 algorithm, from the paper and pseudocode, it seems that you are using the original discrete action space. After action space pruning, what is the actual action space size of AlphaDev? Also, with more sorting algorithms for input arrays, the action space should exhibit exponential growth. How do you mitigate this problem?
Why did you consider using the MultiQuery transformer for encoding assembly algorithms? What are the similarities and differences in performance compared to the classic transformer?
During the learning process, how do the correctness rewards and latency rewards change with the increase of training steps? If possible, could you share some training details?

Once again, thank you for your interesting and inspiring work.

Best wishes.

AndreaMichi commented 1 year ago

These are all excellent questions.

Action Space Yes, you are right, we use discrete action space where an action is an assembly instruction, often taking the form of <ASSEMBLY_OP, LOCATION_1, LOCATION_2>. Choosing a sensible number of operations brings the number of instructions to 100-200 which is sensible for AlphaZero. The scalability challenge you're raising is a fair argument, but there are a few ways we can get around that:

The action space doesn't always grow for all algorithms. For instance, we used AlphaDev to discover new hashing functions: (this is the patch to Abseil). In the hashing domain, the number of instructions remains relatively small, making hashing a really good fit for AlphaDev.
If the action space increases dramatically, we can sample actions instead of enumerating them. This approach was proposed in Learning and Planning in Complex Action Spaces and used in AlphaTensor. The basic idea is to sample a subset of actions for MCTS instead of enumerating all of them. However, it introduces an extra layer of complexity, so if the number of action is small, my suggestion would be to use the default discrete action space.
In addition to sampling, we can breakdown the action space by sampling each dimension autoregressively. More precisely, we could sample first the , then we given the current state and the sampled OP, we would sample the second location and so on. Here, we would need to be careful as we can sample illegal instructions. Some filtering logic or rejection sampling can help.

MultiQuery Transfomer

If I remember correctly, we did play with a classic Transformer as well but didn't notice any improvement in quality (at least in our setup). Given we had already a well-tested implementation of MultiQuery and given it seemed faster we went with it. My intuition is that, in this setup, there is more headroom in better exploration methods than in better modelling for the state space (even though the two can be quite related).

Correctness and Latency

A key insight is that we reward for latency only after we discover a correct algorithm. This is because giving a reward for latency from the start incentivises the agent to discover very fast but useless programs. Some clever re-weighting could address that or a formulation with constrained RL. However, the agent seemed to do quite well when given the latency reward only for correct programs.

The number of steps depends on the algorithm and complexity. In fixed sorting we can define a very natural reward function (i.e. how many items are currently sorted) which leads to a smooth path towards discovering a correct algorithm. In Variable Sort, this is slightly more complex as there is only a single permutation of size 2 which is not sorted (i.e. [2, 1]), but many for larger arrays. To sort that single permutation you need an extra branching which is difficult to discover. This often lead to algorithms that would sort all permutations but one. Instead of treating all permutations of any size equally, we group the permutation by size and reweigh the reward based on how many permutations there are of a certain size. This makes sure that the agent puts too much weight on the larger arrays.

Thanks again for your great questions!

puyuan1996 commented 1 year ago

Hello,

Thank you very much for your comprehensive and informative response, which has successfully addressed my numerous concerns. I appreciate your efforts in advancing RL and promoting scientific development!

Best wishes.

google-deepmind / alphadev

Questions about algorithm design #6