google-deepmind / alphadev

Apache License 2.0
674 stars 67 forks source link

Questions about algorithm design #6

Closed puyuan1996 closed 12 months ago

puyuan1996 commented 1 year ago

Hello,

Thank you very much for your great work. I have 3 questions and hope to get your answers.

Once again, thank you for your interesting and inspiring work.

Best wishes.

AndreaMichi commented 1 year ago

These are all excellent questions.

Action Space Yes, you are right, we use discrete action space where an action is an assembly instruction, often taking the form of <ASSEMBLY_OP, LOCATION_1, LOCATION_2>. Choosing a sensible number of operations brings the number of instructions to 100-200 which is sensible for AlphaZero. The scalability challenge you're raising is a fair argument, but there are a few ways we can get around that:

MultiQuery Transfomer

If I remember correctly, we did play with a classic Transformer as well but didn't notice any improvement in quality (at least in our setup). Given we had already a well-tested implementation of MultiQuery and given it seemed faster we went with it. My intuition is that, in this setup, there is more headroom in better exploration methods than in better modelling for the state space (even though the two can be quite related).

Correctness and Latency

A key insight is that we reward for latency only after we discover a correct algorithm. This is because giving a reward for latency from the start incentivises the agent to discover very fast but useless programs. Some clever re-weighting could address that or a formulation with constrained RL. However, the agent seemed to do quite well when given the latency reward only for correct programs.

The number of steps depends on the algorithm and complexity. In fixed sorting we can define a very natural reward function (i.e. how many items are currently sorted) which leads to a smooth path towards discovering a correct algorithm. In Variable Sort, this is slightly more complex as there is only a single permutation of size 2 which is not sorted (i.e. [2, 1]), but many for larger arrays. To sort that single permutation you need an extra branching which is difficult to discover. This often lead to algorithms that would sort all permutations but one. Instead of treating all permutations of any size equally, we group the permutation by size and reweigh the reward based on how many permutations there are of a certain size. This makes sure that the agent puts too much weight on the larger arrays.

Thanks again for your great questions!

puyuan1996 commented 1 year ago

Hello,

Thank you very much for your comprehensive and informative response, which has successfully addressed my numerous concerns. I appreciate your efforts in advancing RL and promoting scientific development!

Best wishes.