Closed puyuan1996 closed 12 months ago
These are all excellent questions.
Action Space
Yes, you are right, we use discrete action space where an action is an assembly instruction, often taking the form of <ASSEMBLY_OP, LOCATION_1, LOCATION_2>
. Choosing a sensible number of operations brings the number of instructions to 100-200 which is sensible for AlphaZero. The scalability challenge you're raising is a fair argument, but there are a few ways we can get around that:
MultiQuery Transfomer
If I remember correctly, we did play with a classic Transformer as well but didn't notice any improvement in quality (at least in our setup). Given we had already a well-tested implementation of MultiQuery and given it seemed faster we went with it. My intuition is that, in this setup, there is more headroom in better exploration methods than in better modelling for the state space (even though the two can be quite related).
Correctness and Latency
A key insight is that we reward for latency only after we discover a correct algorithm. This is because giving a reward for latency from the start incentivises the agent to discover very fast but useless programs. Some clever re-weighting could address that or a formulation with constrained RL. However, the agent seemed to do quite well when given the latency reward only for correct programs.
The number of steps depends on the algorithm and complexity. In fixed sorting we can define a very natural reward function (i.e. how many items are currently sorted) which leads to a smooth path towards discovering a correct algorithm. In Variable Sort, this is slightly more complex as there is only a single permutation of size 2 which is not sorted (i.e. [2, 1]), but many for larger arrays. To sort that single permutation you need an extra branching which is difficult to discover. This often lead to algorithms that would sort all permutations but one. Instead of treating all permutations of any size equally, we group the permutation by size and reweigh the reward based on how many permutations there are of a certain size. This makes sure that the agent puts too much weight on the larger arrays.
Thanks again for your great questions!
Hello,
Thank you very much for your comprehensive and informative response, which has successfully addressed my numerous concerns. I appreciate your efforts in advancing RL and promoting scientific development!
Best wishes.
Hello,
Thank you very much for your great work. I have 3 questions and hope to get your answers.
As far as I know, complex and large action spaces are a significant challenge for MCTS algorithms. For the sort3 algorithm, from the paper and pseudocode, it seems that you are using the original discrete action space. After action space pruning, what is the actual action space size of AlphaDev? Also, with more sorting algorithms for input arrays, the action space should exhibit exponential growth. How do you mitigate this problem?
Why did you consider using the MultiQuery transformer for encoding assembly algorithms? What are the similarities and differences in performance compared to the classic transformer?
During the learning process, how do the correctness rewards and latency rewards change with the increase of training steps? If possible, could you share some training details?
Once again, thank you for your interesting and inspiring work.
Best wishes.