Human-level control through deep reinforcement learning

:clipboard: 논문의 정보를 알려주세요.

Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver
2015-02-26

:page_with_curl: Abstract(본문)

The theory of reinforcement learning provides a normative account1, deeply rooted in psychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms3. While reinforcement learning agents have achieved some successes in a variety of domains6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

:mag_right: 어떤 논문인지 소개해주세요.

구글 딥마인드가 고안한 알고리즘으로서, 강화학습에 DNN을 접목시켰습니다.
그 간의 강화학습은 충분히 관찰된 저차원 수준이거나, 사람의 손길이 많이 가는 도메인에만 한정되어 왔지만, 본고는 다양한 방법으로 고차원적 감각 입력층으로부터 성공적인 policies를 세울 수 있는 DQN(Deep Q network)을 제안하였습니다.
고차원 도메인인 Arari 2600 게임들을 인간 수준(혹은 뛰어넘는)으로 플레이하도록 학습하는 과정을 보면, 굉장히 흥미롭습니다!
https://www.youtube.com/watch?v=V1eYniJ0Rnk&feature=youtu.be

:key: 핵심 키워드를 적어주세요.

Action-value(Q) function, DQN, CNN

:paperclip: URL

https://www.nature.com/articles/nature14236.

:bulb: 방법은 무엇입니까?

강화학습에서 action-value(Q) function을 위해 NN같은 Non-linear function approximator를 사용하면 학습이 제대로 되지 않고, 강건하지 않습니다.. 그 이유는 아래와 같은데, 각 문제를 다음과 같은 방법으로 해결하였습니다.

데이터 상관관계 : 시간의 흐름에 따른 데이터를 입력으로 사용하므로, 데이터간에 강한 상관관계를 띄게 됩니다. 이를 해결하기 위해 Experience replay라는 방법을 사용하였습니다. 해당 방법은 Data를 Uniformly random하게 추출, Mini-batch하여 데이터의 상관관계를 깨줍니다. 또 다른 장점으로는 과거의 데이터를 더 학습할 수 있습니다.
부트스트랩의 문제점 : MSE는 실제(정답) 값 과 예측 값의 차이를 통해서 산출을 하게 되는데, DQN의 특성상 업데이트의 목표가 되는 정답이 계속 변화합니다. 이를 방지하기 위해 정답을 만들어내는 target network는 고정해두고, 특정 간격마다 예측을 만들어내는 network로 업데이트 해줍니다.
오차함수의 간략화 : 오차함수에 Huber loss로 사용해서 outliar에 강하고 좀 더 안정적인 학습을 하는 모델을 만들었습니다.

사람과 같은 agent를 위해 원시적인 시각 데이터를 바로 입력층에 사용할 수 있어야 합니다. 복잡한 화면을 사용하기위해 전처리를 해주었는데, 60fps의 화면을 받아와서 84x84로 화면 Crop 및 Grey-scale을 적용했고, 특정 위치(k)에 위치한 이미지들만 선택해서(Skipped frame) m개 만큼 쌓아 input으로 사용했습니다.

Reward를 -1과 0, 1로 clipping하였습니다. 학습률은 조금 떨어지겠지만, 여러 게임에 대해 같은 learning late로 학습이 가능합니다..

koptimizer / my_PaperLog

Human-level control through deep reinforcement learning #5

:clipboard: 논문의 정보를 알려주세요.

:page_with_curl: Abstract(본문)

:mag_right: 어떤 논문인지 소개해주세요.

:key: 핵심 키워드를 적어주세요.

:paperclip: URL

:bulb: 방법은 무엇입니까?

:chart_with_upwards_trend: 실험과 그 결과는 어떻습니까?

:open_file_folder: 차후 연구방향 및 보완점은 무엇입니까?

:thumbsup: novelty와 논문을 통해 배운 것은 무엇입니까?

:loop: 궁금한 점이나 추가로 읽으면 좋은 레퍼런스가 있습니까?