Open Gaethje opened 4 years ago
Thank you for your learning results. It's wonderful that you can move 16actor!
My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system?
Yes it is.
Although it is allocation of physical resources, it is 1actor 1CPU. So if you follow the paper, 256 CPU is required.
If you want to implement this repository, you can do it below, but I have not tested it because there is no verification environment.
class MyActor(ActorUser):
(snip)
class MyActor1(MyActor):
allocate = "/device:CPU:0"
class MyActor2(MyActor):
allocate = "/device:CPU:1"
class MyActor3(MyActor):
allocate = "/device:CPU:2"
class MyActor4(MyActor):
allocate = "/device:CPU:3"
...
Also, the mechanism that can dynamically input "allocate" with for is not implemented
Also what you think about my selection of epsilon values for every actor?
This is also proposed by Ape-X(DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY), and each actor determines epsilon by the following formula.
The code can be achieved with:
(snip)
from agent.policy import EpsilonGreedyActor
(snip)
class MyActor(ActorUser):
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
(snip)
(snip)
kwargs["actors"] = []
for i in range(256):
kwargs["actors"].append(MyActor)
I hope it helps you.
=======================================
学習結果ありがとうございます。 16actorも動かせるなんてすばらしいです!
My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system?
はい、合っています。
物理リソースの割り当てですが、1actor 1CPUです。 なので論文通りの場合は256CPUが必要です。
一応本レポジトリで実現する場合は以下でできるようになっていますが検証環境がないのでテストはしていません。
class MyActor(ActorUser):
(snip)
class MyActor1(MyActor):
allocate = "/device:CPU:0"
class MyActor2(MyActor):
allocate = "/device:CPU:1"
class MyActor3(MyActor):
allocate = "/device:CPU:2"
class MyActor4(MyActor):
allocate = "/device:CPU:3"
...
また、allocate は for 等で動的に入力できるようには実装していません。。。
Also what you think about my selection of epsilon values for every actor?
こちらも Ape-X で提案されており、各Actorは以下の式でepsilonを決めています。
コードでは以下のように実装できます。
(snip)
from agent.policy import EpsilonGreedyActor
(snip)
class MyActor(ActorUser):
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
(snip)
(snip)
kwargs["actors"] = []
for i in range(512):
kwargs["actors"].append(MyActor)
Thank you for the reply. one of my pc has 12 cpus. i want to run 12 actors in 12 cpus. As you have mentioned- the mechanism that can dynamically input "allocate" is not implemented, can you give some idea about how i can do that? or can you implement it ? Thanks.
I implemented it. See commit for changes. (1d9f934)
The usage is as follows.
Example1: 12CPU, 12Actors
class MyActor(ActorUser):
@staticmethod
def allocate(actor_index, actor_num):
return "/device:CPU:{}".format(actor_index)
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
(snip)
(snip)
kwargs["actors"] = []
for i in range(12):
kwargs["actors"].append(MyActor)
Example2: 12CPU, 256Actors
class MyActor(ActorUser):
@staticmethod
def allocate(actor_index, actor_num):
return "/device:CPU:{}".format(actor_index % 12)
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
(snip)
(snip)
kwargs["actors"] = []
for i in range(256):
kwargs["actors"].append(MyActor)
Example3: 12CPU, 12Actors + OriginalActor
class MyActor(ActorUser):
@staticmethod
def allocate(actor_index, actor_num):
return "/device:CPU:{}".format(actor_index)
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
(snip)
(snip)
class OriginalActor(ActorUser):
@staticmethod
def allocate(actor_index, actor_num):
return "/device:CPU:0"
def getPolicy(self, actor_index, actor_num):
return AnnealingEpsilonGreedy(
initial_epsilon=1.0,
final_epsilon=0.01,
exploration_steps=1_750_000
)
def fit(self, index, agent):
(snip)
(snip)
kwargs["actors"] = []
for i in range(12):
kwargs["actors"].append(MyActor)
kwargs["actors"].append(OriginalActor)
Thanks.
Thanks! I got two more questions.
if True:
run_dqn(enable_train=True)
What is the reason behind that? is DQN somehow linked with agent57?
if True:
run_agent57(enable_train=True)
I tried with 12 actors dedicated to 12 cpus. But it gets stuck. Why it stucks? Does it need more time to extract result?
1. What is the reason behind that? is DQN somehow linked with agent57?
The difference between DQN and agent57 in my repository is whether I'm using multiprocessing. (This is a single Actor as it doesn't use multiprocessing)
I say DQN, but it's not a naive DQN(https://arxiv.org/pdf/1312.5602.pdf). Implements methods other than distributed training up to agent57.
The reason for implementing DQN is that debugging was very difficult with multiprocessing. Therefore, DQN is basically for operation check.
2. But it gets stuck. Why it stuck?
First of all, please check if the log is output by the following procedure.
Change log output interval The log_interval of run_gym_agent57 specifies the log output in seconds. Try changing to 1 minute and see if there is output.
Use keras-rl log output function This is the method of using the log output function supported by the keras-rl.
class MyActor(ActorUser):
def fit(self, index, agent):
env = gym.make(ENV_NAME)
agent.fit(env, visualize=False, verbose=1) # verbose 0→1
env.close()
However, since there are a lot of logs, it is better to output only one actor as follows. Also, this only outputs Actor logs.
class MyActor(ActorUser):
def fit(self, index, agent):
env = gym.make(ENV_NAME)
if index == 0:
verbose = 1
else:
verbose = 0
agent.fit(env, visualize=False, verbose=verbose)
env.close()
=============================
1. 私のレポジトリにおけるDQNとagent57の違いはmultiprocessingを使っているかどうかです。 (multiprocessingを使っていないのでシングルActorです) DQNと言っていますがナイーブDQNではなく、agent57までの分散トレーニング以外の手法を実装しています。 なぜ分散トレーニングがないものを実装しているかというと、multiprocessingに比べてデバッグが非常に大変だったからです。 なので、DQNは動作確認用の意味合いが強いです。
2. まずは以下でログが出るかどうかを確認してみて下さい。
ログの出力間隔を変える run_gym_agent57 の log_interval 関数はログ出力を秒で指定します。 1分などに変えてみて出力があるか見てみてください。
keras-rlのログ出力機能を使う
keras-rlの公式でサポートされているログ出力機能を使う方法です。
class MyActor(ActorUser):
def fit(self, index, agent):
env = gym.make(ENV_NAME)
agent.fit(env, visualize=False, verbose=1) # verbose 0→1
env.close()
ただし、ログが大量にでるので以下のように1つのactorにのみ出力させた方がいいでしょう。 また、これはActorのログしか出力しません。
class MyActor(ActorUser):
def fit(self, index, agent):
env = gym.make(ENV_NAME)
if index == 0:
verbose = 1
else:
verbose = 0
agent.fit(env, visualize=False, verbose=verbose)
env.close()
This commit (175c836) changed the UVFA implementation significantly. I will tell you because it also affects the parameters.
Change input model variable name
image_model
image_model_emb
image_model_rnd
↓
input_model
input_model_emb
input_model_rnd
Changed from valid/invalid of Intrinsic reward to valid/invalid of action value function
enable_intrinsic_reward
↓
enable_intrinsic_actval_model
Added so that UVFA input items can be set
No input
uvfa_ext=[] # Extrinsic reward
uvfa_int=[] # Intrinsic reward
All inputs available
uvfa_ext=[
UvfaType.ACTION,
UvfaType.REWARD_EXT,
UvfaType.REWARD_INT,
UvfaType.POLICY,
]
uvfa_int=[
UvfaType.ACTION,
UvfaType.REWARD_EXT,
UvfaType.REWARD_INT,
UvfaType.POLICY,
]
enable_add_episode_end_frame=True
When enabled, it will add a reward 0 frame to the end of the episode (terminal=True).
test_policy = 0
Thanks!
My result is like this for my custom environment-
Actor10 End!
actor10 Train 406, Time: 10.11m, Reward : 3.59 - 13.47 (ave: 11.00), nb_steps: 0
Actor7 End!
actor7 Train 406, Time: 10.11m, Reward : -24.91 - 24.36 (ave: 6.60), nb_steps: 0
Actor4 End!
actor4 Train 406, Time: 10.11m, Reward : -23.05 - 24.74 (ave: 8.54), nb_steps: 0
Actor0 End!
actor0 Train 406, Time: 10.11m, Reward : -25.61 - 36.51 (ave: 3.79), nb_steps: 0
Actor3 End!
actor3 Train 406, Time: 10.11m, Reward : 19.61 - 65.45 (ave: 32.98), nb_steps: 0
Actor6 End!
actor6 Train 406, Time: 10.11m, Reward : -27.59 - 18.14 (ave: 3.69), nb_steps: 0
Actor11 End!
actor11 Train 406, Time: 10.11m, Reward : 2.82 - 13.47 (ave: 10.81), nb_steps: 0
Actor8 End!
actor8 Train 406, Time: 10.11m, Reward : 13.47 - 22.87 (ave: 16.03), nb_steps: 0
Actor2 End!
actor2 Train 406, Time: 10.11m, Reward : -13.12 - 39.41 (ave: 14.92), nb_steps: 0
Actor1 End!
actor1 Train 406, Time: 10.11m, Reward : -11.05 - 28.77 (ave: 13.34), nb_steps: 0
Actor9 End!
actor9 Train 406, Time: 10.11m, Reward : 12.61 - 65.34 (ave: 26.15), nb_steps: 0
Actor5 End!
actor5 Train 406, Time: 10.11m, Reward : -18.39 - 22.14 (ave: 6.89), nb_steps: 0
done, took 607.004 seconds
done, took 10.117 minutes
2020-09-02 12:15:00.410371: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-09-02 12:15:00.466537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3591955000 Hz
2020-09-02 12:15:00.469562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555f1e239f60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-02 12:15:00.469592: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-02 12:15:00.524977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 12:15:00.524997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
Testing for 5 episodes ...
Traceback (most recent call last):
File "test1.py", line 297, in
................
I am confused,is training completed?
The meaning of the log is as follows.
Indicates that each Actor has finished. For example, below is Actor5.
Actor5 End!
The result when the Actor ends.
actor5 Train 406, Time: 10.11m, Reward : -18.39 - 22.14 (ave: 6.89), nb_steps: 0
So Learner has run train 406 times. (Learner is running)
Output after all Learners and Actors have finished. So training is over.
done, took 607.004 seconds # Output by Manager of agent57
done, took 10.117 minutes # Output by callbacks.TrainLogger
Output from tensorflow(keras).
2020-09-02 12:15:00.410371: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-09-02 12:15:00.466537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3591955000 Hz
2020-09-02 12:15:00.469562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555f1e239f60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-02 12:15:00.469592: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-02 12:15:00.524977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 12:15:00.524997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
The output of agent.test by Keras-rl. Indicates that main.runner's agent.test ran.
Testing for 5 episodes ...
Traceback happens during test.
File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/main_runner.py", line 180, in run_gym_agent57
agent.test(env, nb_episodes=5, visualize=False)
Indicates that self.exp_q is None in agent57.
File "/home/initial/ws_faisal/ag57cp/test/gym-motionplan/agent/agent57.py", line 709, in forward
self.exp_q.put(exp)
AttributeError: 'NoneType' object has no attribute 'put'
Did you change the code in agent57.py? I can't think of a situation where "self.exp_q" is "None" in the repository code. (I'm sorry if there is)
The expected execution result is as follows. (This is the execution result of examples/pendulum.py)
Using TensorFlow backend.
(snip tensorflow logs)
action_space : Box(1,)
observation_space : Box(3,)
reward_range : (-inf, inf)
(snip tensorflow logs)
nb_time : 60.00m
nb_trains: 20000
--- start ---
'Ctrl + C' is stop.
GPU: enable
Using TensorFlow backend.
(snip tensorflow logs)
Actor0 Start!
Actor1 Start!
demo replay loaded, on_memory: 201, total reward: 189.38389544746022
Learner Start!
episode add, reward:59.6274 length: 201 on_memory: 201
episode add, reward:72.4180 length: 201 on_memory: 402
weight save, ave reward:69.7143
learner Train 1, Time: 0.26m, TestReward: 60.03 - 77.03 (ave: 69.71)
episode add, reward:74.3184 length: 201 on_memory: 603
episode add, reward:81.7504 length: 201 on_memory: 804
actor1 Train 1, Time: 0.57m, Reward : 45.34 - 76.31 (ave: 64.13), nb_steps: 4000
actor0 Train 1, Time: 0.57m, Reward : 41.60 - 81.75 (ave: 64.42), nb_steps: 4000
episode add, reward:88.9197 length: 201 on_memory: 1000
episode add, reward:89.2470 length: 201 on_memory: 1000
episode add, reward:90.1501 length: 201 on_memory: 1000
episode add, reward:99.9183 length: 201 on_memory: 1000
episode add, reward:103.9109 length: 201 on_memory: 1000
episode add, reward:128.3833 length: 201 on_memory: 1000
learner Train 1000, Time: 1.12m, TestReward: 41.70 - 80.50 (ave: 64.70)
actor1 Train 1022, Time: 1.43m, Reward : 24.32 - 128.38 (ave: 69.20), nb_steps: 11200
actor0 Train 1022, Time: 1.43m, Reward : 28.32 - 109.99 (ave: 74.70), nb_steps: 11000
weight save, ave reward:100.1413
learner Train 2000, Time: 1.97m, TestReward: 61.38 - 139.78 (ave: 100.14)
actor1 Train 2005, Time: 2.27m, Reward : 40.12 - 108.20 (ave: 80.50), nb_steps: 18000
actor0 Train 2007, Time: 2.28m, Reward : 63.87 - 110.69 (ave: 87.59), nb_steps: 17600
(snip train logs)
learner Train 18000, Time: 16.24m, TestReward: 172.98 - 199.84 (ave: 186.45)
actor1 Train 18000, Time: 16.56m, Reward : 138.00 - 199.65 (ave: 175.77), nb_steps: 130000
actor0 Train 18006, Time: 16.58m, Reward : 139.36 - 199.95 (ave: 180.87), nb_steps: 127200
learner Train 19000, Time: 17.19m, TestReward: 170.52 - 199.90 (ave: 188.21)
actor1 Train 19000, Time: 17.51m, Reward : 109.67 - 199.31 (ave: 172.41), nb_steps: 137200
actor0 Train 19001, Time: 17.52m, Reward : 137.07 - 199.94 (ave: 175.35), nb_steps: 134200
learner Train 20000, Time: 18.13m, TestReward: 172.59 - 189.10 (ave: 184.87)
episode add, reward:199.9639 length: 201 on_memory: 1000
Learning End. Train Count:20001
actor1 Train 20001, Time: 18.47m, Reward : 57.17 - 189.29 (ave: 160.57), nb_steps: 144400
actor0 Train 20001, Time: 18.47m, Reward : 64.01 - 199.96 (ave: 172.06), nb_steps: 141200
learner Train 20001, Time: 18.47m, TestReward: 170.15 - 199.95 (ave: 183.15)
Actor1 End!
actor1 Train 20001, Time: 18.77m, Reward : 142.04 - 199.67 (ave: 172.28), nb_steps: 0
Actor0 End!
actor0 Train 20001, Time: 18.77m, Reward : 136.57 - 199.97 (ave: 184.06), nb_steps: 0
done, took 18.778 minutes
(That's all for learning. Test after that.)
Testing for 5 episodes ...
2020-09-02 21:54:25.537992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
Episode 1: reward: 179.033, steps: 200
Episode 2: reward: 188.682, steps: 200
Episode 3: reward: 170.629, steps: 200
Episode 4: reward: 172.840, steps: 200
Episode 5: reward: 188.490, steps: 200
I hope you find it helpful in troubleshooting.
Thanks its very helpful. I want to ask you some questions about the output structure.
Why the output is getting printed after certain period of time? Does it depends on steps? Which parameter decides it to print the output? What is the meaning of time here?
What is the difference between Test Reward and Reward? Why it changes in every environment?
Thank you for your question.
- Why the output is getting printed after certain period of time? Does it depends on steps?
This is implemented by callbacks.TrainLogger or callbacks.DisTrainLogger. (These are created by extending callbacks of Keras-rl.) (Dis is short for Distribute and is supposed to be used by agent57.)
DisTrainLogger has the following arguments, and run_gym_agent57 has equivalent arguments.
logger_type
interval(log_interval) The meaning depends on the logger_type. If it is TIME, it will be the output interval (seconds), and if it is STEP, it will be the output interval (number of times).
The output format can be changed by changing this value.
- What is the difference between Test Reward and Reward? Why it changes in every environment?
"Reward" is a value calculated based on the value acquired during learning. "TestReward" is the value when executed in the same environment as test.
The differences in my repository are as follows.
Reward | TestReward | |
---|---|---|
How to select an action | Follow action policy (Epsilon Greedy, etc.) | Maximum Q value |
Select Policy (for NGU/agent57) | Determined by UCB value | 0(No Exploration Policy) |
I am trying to run Alien-v0 with 16 actors. My code is like this-
I have run this on my machine and after running around 4 hours, the maximum reward is 970. My question is, is this the right way to increase the number of actors? Another question is, how many actors can i run in my system? how to calculate the number of actors? For example- in the paper they used 256 actors. How much physical resource do i need to work with 256 actors? Also what you think about my selection of epsilon values for every actor? it is referenced that the epsilon is chosen according the formula of this paper- DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY formula they mentioned for choosing epsilon for every actor-This is mentioned in section 4.1-atari.
Great repository! thanks!