多臂赌博机 | ghx's - Githubissues

pluto-the-lost / pluto-the-lost.github.io

1 stars 0 forks source link

多臂赌博机 | ghx's #7

Open pluto-the-lost opened 2 years ago

pluto-the-lost commented 2 years ago

https://pluto-the-lost.github.io/blogs/2019/07/10/%E5%A4%9A%E8%87%82%E8%B5%8C%E5%8D%9A%E6%9C%BA/#

多臂赌博机(Multi-armed Bandits) 1.1 问题描述强化学习和监督学习的最大区别是，对于一个动作，RL给出的是评估(evaluation)，而SL给出的是判断或者说指导(instruction)。意思是说，RL通过价值函数告诉你这个动作有多好，而并不告诉你这个动作是最好的或最差的，SL正相反，他会告诉你哪个动作是正确的。当然也有一些情况，评估和指导可以联合起来训练模型，但是这里我们先用多臂赌博机来展示一下RL“给出评估”的特点，同时也展示一些最基本的RL方法。