pocokhc / agent57

Qiita投稿用に作成したAgent57(強化学習)の実装コードです。
MIT License
45 stars 14 forks source link

Trying to increase the number of actors for Alien-v0 #4

Open Gaethje opened 4 years ago

Gaethje commented 4 years ago

I am trying to run Alien-v0 with 16 actors. My code is like this-

(snip)
from agent.processor import AtariProcessor
(snip)
ENV_NAME = "Alien-v0"
(snip)
def create_parameter(env):
    processor = AtariProcessor()  

(snip)
class MyActor(ActorUser):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.1)

    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        agent.fit(env, visualize=False, verbose=0)
        env.close()
class MyActor1(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.4)

class MyActor2(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.24)

class MyActor3(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.15)

class MyActor4(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.09)

class MyActor5(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.056)

class MyActor6(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.034)

class MyActor7(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.021)

class MyActor8(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.013)
class MyActor9(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.008)

class MyActor10(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.004)

class MyActor11(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.003)

class MyActor12(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.001)
class MyActor13(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.00111)

class MyActor14(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.0006)

class MyActor15(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.0004)

class MyActor16(MyActor):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedy(0.0002)
(snip)
 kwargs["actors"] = [MyActor1, MyActor2, MyActor3, MyActor4,MyActor5, MyActor6, MyActor7, MyActor8,MyActor9, MyActor10, MyActor11, MyActor12,MyActor13, MyActor14, MyActor15, MyActor16]

I have run this on my machine and after running around 4 hours, the maximum reward is 970. My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system? how to calculate the number of actors? For example- in the paper they used 256 actors. How much physical resource do i need to work with 256 actors? Also what you think about my selection of epsilon values for every actor? it is referenced that the epsilon is chosen according the formula of this paper- DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY formula they mentioned for choosing epsilon for every actor-This is mentioned in section 4.1-atari.

Great repository! thanks!

pocokhc commented 4 years ago

Thank you for your learning results. It's wonderful that you can move 16actor!

My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system?

Yes it is.

Although it is allocation of physical resources, it is 1actor 1CPU. So if you follow the paper, 256 CPU is required.

If you want to implement this repository, you can do it below, but I have not tested it because there is no verification environment.

class MyActor(ActorUser):
   (snip)

class MyActor1(MyActor):
   allocate = "/device:CPU:0"

class MyActor2(MyActor):
   allocate = "/device:CPU:1"

class MyActor3(MyActor):
   allocate = "/device:CPU:2"

class MyActor4(MyActor):
   allocate = "/device:CPU:3"

...

Also, the mechanism that can dynamically input "allocate" with for is not implemented

Also what you think about my selection of epsilon values for every actor?

This is also proposed by Ape-X(DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY), and each actor determines epsilon by the following formula.

The code can be achieved with:

(snip)
from agent.policy import EpsilonGreedyActor
(snip)

class MyActor(ActorUser):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        (snip)

(snip)
kwargs["actors"] = []
for i in range(256):
    kwargs["actors"].append(MyActor)

I hope it helps you.

=======================================

学習結果ありがとうございます。 16actorも動かせるなんてすばらしいです!

My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system?

はい、合っています。

物理リソースの割り当てですが、1actor 1CPUです。 なので論文通りの場合は256CPUが必要です。

一応本レポジトリで実現する場合は以下でできるようになっていますが検証環境がないのでテストはしていません。

class MyActor(ActorUser):
   (snip)

class MyActor1(MyActor):
   allocate = "/device:CPU:0"

class MyActor2(MyActor):
   allocate = "/device:CPU:1"

class MyActor3(MyActor):
   allocate = "/device:CPU:2"

class MyActor4(MyActor):
   allocate = "/device:CPU:3"

...

また、allocate は for 等で動的に入力できるようには実装していません。。。

Also what you think about my selection of epsilon values for every actor?

こちらも Ape-X で提案されており、各Actorは以下の式でepsilonを決めています。

コードでは以下のように実装できます。

(snip)
from agent.policy import EpsilonGreedyActor
(snip)

class MyActor(ActorUser):
    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        (snip)

(snip)
kwargs["actors"] = []
for i in range(512):
    kwargs["actors"].append(MyActor)
Gaethje commented 4 years ago

Thank you for the reply. one of my pc has 12 cpus. i want to run 12 actors in 12 cpus. As you have mentioned- the mechanism that can dynamically input "allocate" is not implemented, can you give some idea about how i can do that? or can you implement it ? Thanks.

pocokhc commented 4 years ago

I implemented it. See commit for changes. (1d9f934)

The usage is as follows.

Example1: 12CPU, 12Actors

class MyActor(ActorUser):
    @staticmethod
    def allocate(actor_index, actor_num):
        return "/device:CPU:{}".format(actor_index)

    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        (snip)
(snip)

kwargs["actors"] = []
for i in range(12):
    kwargs["actors"].append(MyActor)

Example2: 12CPU, 256Actors

class MyActor(ActorUser):
    @staticmethod
    def allocate(actor_index, actor_num):
        return "/device:CPU:{}".format(actor_index % 12)

    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        (snip)
(snip)

kwargs["actors"] = []
for i in range(256):
    kwargs["actors"].append(MyActor)

Example3: 12CPU, 12Actors + OriginalActor

class MyActor(ActorUser):
    @staticmethod
    def allocate(actor_index, actor_num):
        return "/device:CPU:{}".format(actor_index)

    def getPolicy(self, actor_index, actor_num):
        return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)

    def fit(self, index, agent):
        (snip)
(snip)

class OriginalActor(ActorUser):
    @staticmethod
    def allocate(actor_index, actor_num):
        return "/device:CPU:0"

    def getPolicy(self, actor_index, actor_num):
        return AnnealingEpsilonGreedy(
            initial_epsilon=1.0,
            final_epsilon=0.01,
            exploration_steps=1_750_000
        )

    def fit(self, index, agent):
        (snip)
(snip)

kwargs["actors"] = []
for i in range(12):
    kwargs["actors"].append(MyActor)

kwargs["actors"].append(OriginalActor)

Thanks.

Gaethje commented 4 years ago

Thanks! I got two more questions.

  1. When I see the main function of atari_pong.py, it seems like only DQN is running, not agent57.
    if True:
        run_dqn(enable_train=True)

    What is the reason behind that? is DQN somehow linked with agent57?

  2. I tried to run agent57 instead of dqn(changing the main function of atari_pong.py).
    if True:
        run_agent57(enable_train=True)

    I tried with 12 actors dedicated to 12 cpus. But it gets stuck. Why it stucks? Does it need more time to extract result?

pocokhc commented 4 years ago

1. What is the reason behind that? is DQN somehow linked with agent57?

The difference between DQN and agent57 in my repository is whether I'm using multiprocessing. (This is a single Actor as it doesn't use multiprocessing)

I say DQN, but it's not a naive DQN(https://arxiv.org/pdf/1312.5602.pdf). Implements methods other than distributed training up to agent57.

The reason for implementing DQN is that debugging was very difficult with multiprocessing. Therefore, DQN is basically for operation check.

2. But it gets stuck. Why it stuck?

First of all, please check if the log is output by the following procedure.

  1. Change log output interval The log_interval of run_gym_agent57 specifies the log output in seconds. Try changing to 1 minute and see if there is output.

  2. Use keras-rl log output function This is the method of using the log output function supported by the keras-rl.

class MyActor(ActorUser):
    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        agent.fit(env, visualize=False, verbose=1)  # verbose 0→1
        env.close()

However, since there are a lot of logs, it is better to output only one actor as follows. Also, this only outputs Actor logs.

class MyActor(ActorUser):
    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        if index == 0:
            verbose = 1
        else:
            verbose = 0
        agent.fit(env, visualize=False, verbose=verbose)
        env.close()

=============================

1. 私のレポジトリにおけるDQNとagent57の違いはmultiprocessingを使っているかどうかです。 (multiprocessingを使っていないのでシングルActorです) DQNと言っていますがナイーブDQNではなく、agent57までの分散トレーニング以外の手法を実装しています。 なぜ分散トレーニングがないものを実装しているかというと、multiprocessingに比べてデバッグが非常に大変だったからです。 なので、DQNは動作確認用の意味合いが強いです。

2. まずは以下でログが出るかどうかを確認してみて下さい。

  1. ログの出力間隔を変える run_gym_agent57 の log_interval 関数はログ出力を秒で指定します。 1分などに変えてみて出力があるか見てみてください。

  2. keras-rlのログ出力機能を使う

keras-rlの公式でサポートされているログ出力機能を使う方法です。

class MyActor(ActorUser):
    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        agent.fit(env, visualize=False, verbose=1)  # verbose 0→1
        env.close()

ただし、ログが大量にでるので以下のように1つのactorにのみ出力させた方がいいでしょう。 また、これはActorのログしか出力しません。

class MyActor(ActorUser):
    def fit(self, index, agent):
        env = gym.make(ENV_NAME)
        if index == 0:
            verbose = 1
        else:
            verbose = 0
        agent.fit(env, visualize=False, verbose=verbose)
        env.close()
pocokhc commented 4 years ago

This commit (175c836) changed the UVFA implementation significantly. I will tell you because it also affects the parameters.

  1. Change input model variable name

    image_model
    image_model_emb
    image_model_rnd

    input_model
    input_model_emb
    input_model_rnd
  2. Changed from valid/invalid of Intrinsic reward to valid/invalid of action value function

    enable_intrinsic_reward

    enable_intrinsic_actval_model
  3. Added so that UVFA input items can be set

No input

uvfa_ext=[]  # Extrinsic reward
uvfa_int=[]  # Intrinsic reward

All inputs available

uvfa_ext=[
  UvfaType.ACTION,
  UvfaType.REWARD_EXT,
  UvfaType.REWARD_INT,
  UvfaType.POLICY,
]
uvfa_int=[
  UvfaType.ACTION,
  UvfaType.REWARD_EXT,
  UvfaType.REWARD_INT,
  UvfaType.POLICY,
]
  1. Add a null frame at the end of the episode(option)
enable_add_episode_end_frame=True

When enabled, it will add a reward 0 frame to the end of the episode (terminal=True).

  1. Added an option to change the policy to be executed during the test
test_policy = 0
Gaethje commented 4 years ago

Thanks!

My result is like this for my custom environment-

Actor10 End! actor10 Train 406, Time: 10.11m, Reward : 3.59 - 13.47 (ave: 11.00), nb_steps: 0 Actor7 End! actor7 Train 406, Time: 10.11m, Reward : -24.91 - 24.36 (ave: 6.60), nb_steps: 0 Actor4 End! actor4 Train 406, Time: 10.11m, Reward : -23.05 - 24.74 (ave: 8.54), nb_steps: 0 Actor0 End! actor0 Train 406, Time: 10.11m, Reward : -25.61 - 36.51 (ave: 3.79), nb_steps: 0 Actor3 End! actor3 Train 406, Time: 10.11m, Reward : 19.61 - 65.45 (ave: 32.98), nb_steps: 0 Actor6 End! actor6 Train 406, Time: 10.11m, Reward : -27.59 - 18.14 (ave: 3.69), nb_steps: 0 Actor11 End! actor11 Train 406, Time: 10.11m, Reward : 2.82 - 13.47 (ave: 10.81), nb_steps: 0 Actor8 End! actor8 Train 406, Time: 10.11m, Reward : 13.47 - 22.87 (ave: 16.03), nb_steps: 0 Actor2 End! actor2 Train 406, Time: 10.11m, Reward : -13.12 - 39.41 (ave: 14.92), nb_steps: 0 Actor1 End! actor1 Train 406, Time: 10.11m, Reward : -11.05 - 28.77 (ave: 13.34), nb_steps: 0 Actor9 End! actor9 Train 406, Time: 10.11m, Reward : 12.61 - 65.34 (ave: 26.15), nb_steps: 0 Actor5 End! actor5 Train 406, Time: 10.11m, Reward : -18.39 - 22.14 (ave: 6.89), nb_steps: 0 done, took 607.004 seconds done, took 10.117 minutes 2020-09-02 12:15:00.410371: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-09-02 12:15:00.466537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3591955000 Hz 2020-09-02 12:15:00.469562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555f1e239f60 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-09-02 12:15:00.469592: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-09-02 12:15:00.524977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-09-02 12:15:00.524997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] Testing for 5 episodes ... Traceback (most recent call last): File "test1.py", line 297, in run_agent57(enable_train=True) File "test1.py", line 256, in run_agent57 movie_save=False, File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/main_runner.py", line 180, in run_gym_agent57 agent.test(env, nb_episodes=5, visualize=False) File "/home/initial/miniconda3/envs/agent57/lib/python3.7/site-packages/rl/core.py", line 342, in test action = self.forward(observation) File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/agent57.py", line 709, in forward self.exp_q.put(exp) AttributeError: 'NoneType' object has no attribute 'put'

................

I am confused,is training completed?

pocokhc commented 4 years ago

The meaning of the log is as follows.

Indicates that each Actor has finished. For example, below is Actor5.

Actor5 End!

The result when the Actor ends.

actor5 Train 406, Time: 10.11m, Reward : -18.39 - 22.14 (ave: 6.89), nb_steps: 0

So Learner has run train 406 times. (Learner is running)

Output after all Learners and Actors have finished. So training is over.

done, took 607.004 seconds  # Output by Manager of agent57
done, took 10.117 minutes  # Output by callbacks.TrainLogger

Output from tensorflow(keras).

2020-09-02 12:15:00.410371: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-09-02 12:15:00.466537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3591955000 Hz
2020-09-02 12:15:00.469562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555f1e239f60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-02 12:15:00.469592: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-02 12:15:00.524977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 12:15:00.524997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]

The output of agent.test by Keras-rl. Indicates that main.runner's agent.test ran.

Testing for 5 episodes ...

Traceback happens during test.

File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/main_runner.py", line 180, in run_gym_agent57
agent.test(env, nb_episodes=5, visualize=False)

Indicates that self.exp_q is None in agent57.

File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/agent57.py", line 709, in forward
self.exp_q.put(exp)
AttributeError: 'NoneType' object has no attribute 'put'

Did you change the code in agent57.py? I can't think of a situation where "self.exp_q" is "None" in the repository code. (I'm sorry if there is)

The expected execution result is as follows. (This is the execution result of examples/pendulum.py)

Using TensorFlow backend.

(snip tensorflow logs)

action_space      : Box(1,)
observation_space : Box(3,)
reward_range      : (-inf, inf)

(snip tensorflow logs)

nb_time  : 60.00m
nb_trains: 20000
--- start ---
'Ctrl + C' is stop.
GPU: enable
Using TensorFlow backend.

(snip tensorflow logs)

Actor0 Start!
Actor1 Start!
demo replay loaded, on_memory: 201, total reward: 189.38389544746022
Learner Start!
episode add, reward:59.6274 length: 201 on_memory: 201
episode add, reward:72.4180 length: 201 on_memory: 402
weight save, ave reward:69.7143
learner  Train 1, Time: 0.26m, TestReward:   60.03 -   77.03 (ave:   69.71)
episode add, reward:74.3184 length: 201 on_memory: 603
episode add, reward:81.7504 length: 201 on_memory: 804
actor1   Train 1, Time: 0.57m, Reward    :   45.34 -   76.31 (ave:   64.13), nb_steps: 4000
actor0   Train 1, Time: 0.57m, Reward    :   41.60 -   81.75 (ave:   64.42), nb_steps: 4000
episode add, reward:88.9197 length: 201 on_memory: 1000
episode add, reward:89.2470 length: 201 on_memory: 1000
episode add, reward:90.1501 length: 201 on_memory: 1000
episode add, reward:99.9183 length: 201 on_memory: 1000
episode add, reward:103.9109 length: 201 on_memory: 1000
episode add, reward:128.3833 length: 201 on_memory: 1000
learner  Train 1000, Time: 1.12m, TestReward:   41.70 -   80.50 (ave:   64.70)
actor1   Train 1022, Time: 1.43m, Reward    :   24.32 -  128.38 (ave:   69.20), nb_steps: 11200
actor0   Train 1022, Time: 1.43m, Reward    :   28.32 -  109.99 (ave:   74.70), nb_steps: 11000
weight save, ave reward:100.1413
learner  Train 2000, Time: 1.97m, TestReward:   61.38 -  139.78 (ave:  100.14)
actor1   Train 2005, Time: 2.27m, Reward    :   40.12 -  108.20 (ave:   80.50), nb_steps: 18000
actor0   Train 2007, Time: 2.28m, Reward    :   63.87 -  110.69 (ave:   87.59), nb_steps: 17600

(snip train logs)

learner  Train 18000, Time: 16.24m, TestReward:  172.98 -  199.84 (ave:  186.45)
actor1   Train 18000, Time: 16.56m, Reward    :  138.00 -  199.65 (ave:  175.77), nb_steps: 130000
actor0   Train 18006, Time: 16.58m, Reward    :  139.36 -  199.95 (ave:  180.87), nb_steps: 127200
learner  Train 19000, Time: 17.19m, TestReward:  170.52 -  199.90 (ave:  188.21)
actor1   Train 19000, Time: 17.51m, Reward    :  109.67 -  199.31 (ave:  172.41), nb_steps: 137200
actor0   Train 19001, Time: 17.52m, Reward    :  137.07 -  199.94 (ave:  175.35), nb_steps: 134200
learner  Train 20000, Time: 18.13m, TestReward:  172.59 -  189.10 (ave:  184.87)
episode add, reward:199.9639 length: 201 on_memory: 1000
Learning End. Train Count:20001
actor1   Train 20001, Time: 18.47m, Reward    :   57.17 -  189.29 (ave:  160.57), nb_steps: 144400
actor0   Train 20001, Time: 18.47m, Reward    :   64.01 -  199.96 (ave:  172.06), nb_steps: 141200
learner  Train 20001, Time: 18.47m, TestReward:  170.15 -  199.95 (ave:  183.15)
Actor1 End!
actor1   Train 20001, Time: 18.77m, Reward    :  142.04 -  199.67 (ave:  172.28), nb_steps: 0
Actor0 End!
actor0   Train 20001, Time: 18.77m, Reward    :  136.57 -  199.97 (ave:  184.06), nb_steps: 0
done, took 18.778 minutes

(That's all for learning. Test after that.)

Testing for 5 episodes ...
2020-09-02 21:54:25.537992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
Episode 1: reward: 179.033, steps: 200
Episode 2: reward: 188.682, steps: 200
Episode 3: reward: 170.629, steps: 200
Episode 4: reward: 172.840, steps: 200
Episode 5: reward: 188.490, steps: 200

I hope you find it helpful in troubleshooting.

Gaethje commented 4 years ago

Thanks its very helpful. I want to ask you some questions about the output structure.

  1. Why the output is getting printed after certain period of time? Does it depends on steps? Which parameter decides it to print the output? What is the meaning of time here?

  2. What is the difference between Test Reward and Reward? Why it changes in every environment?

pocokhc commented 4 years ago

Thank you for your question.

  1. Why the output is getting printed after certain period of time? Does it depends on steps?

This is implemented by callbacks.TrainLogger or callbacks.DisTrainLogger. (These are created by extending callbacks of Keras-rl.) (Dis is short for Distribute and is supposed to be used by agent57.)

DisTrainLogger has the following arguments, and run_gym_agent57 has equivalent arguments.

  1. logger_type

    1. LoggerType.TIME: Output interval is the time
    2. LoggerType.STEP: Output interval is the number of steps
  2. interval(log_interval) The meaning depends on the logger_type. If it is TIME, it will be the output interval (seconds), and if it is STEP, it will be the output interval (number of times).

The output format can be changed by changing this value.

  1. What is the difference between Test Reward and Reward? Why it changes in every environment?

"Reward" is a value calculated based on the value acquired during learning. "TestReward" is the value when executed in the same environment as test.

The differences in my repository are as follows.

Reward TestReward
How to select an action Follow action policy (Epsilon Greedy, etc.) Maximum Q value
Select Policy (for NGU/agent57) Determined by UCB value 0(No Exploration Policy)