Closed JannesKlaas closed 7 years ago
Hey,
First, thanks for reporting the issue. The first problem you encounter is most likely due to a bug in our current implementation of TRPO with multiple actions, which should hopefully be fixed in the next 1-2 days. I'll let you know.
I'm not sure about the second exception you get -- what are you redefining afterwards such that it works? Anyway, the problem with min_value
and max_value
you mention afterwards is a general problem currently. Although the feature, which I guess makes sense generally, is already supported for the action interface, it is not yet supported for action distributions etc, so it is essentially just ignored. This is because we so far only provide Gaussian as continuous distribution, which does not naturally define min/max values. Does it nevertheless work, ignoring the out-of-bound values?
Regarding the second issue, I looked at it further: Configuration objects are changed when they are used to create an agent. This makes them unusable when creating the next agent. Here is how the issue shows itself:
#Define config
config = Configuration(
batch_size=100,
states=dict(shape=(10,), type='float'),
actions=dict(continuous=False, num_actions=2),
network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)
#Define first agent (works)
agent = TRPOAgent(config=config)
#Define second agent (also works)
agent1 = TRPOAgent(config=config)
#Define state
state = np.array([1,2,3,4,5,6,7,8,9,10])
#First agent acts (works)
agent.act(state)
#Second agent acts (crashes)
agent1.act(state)
I looked into the agent code and I think I found the issue. The code creating the agent modifies the configuration passed along. Before declaring the agent, print(config) prints:
{actions={continuous=False, num_actions=2}, states={type=float, shape=(10,)}, batch_size=100, network=<function layered_network_builder.<locals>.network_builder at 0x111b8d598>}
after
agent = TRPOAgent(config=config)
print(config) outputs:
{device=None, cg_iterations=20, optimizer=None, cg_damping=0.001, log_level=info, network=<function layered_network_builder.<locals>.network_builder at 0x111b8d598>, global_model=False, exploration=None, normalize_advantage=False, max_kl_divergence=0.001, preprocessing=None, discount=0.97, states={state={type=float, shape=(10,)}}, session=None, distributed=False, line_search_steps=20, batch_size=100, actions={action={continuous=False, num_actions=2}}, tf_summary=None, learning_rate=0.0001, generalized_advantage_estimation=False, tf_saver=False, baseline=None, gae_lambda=0.97, override_line_search=False}
On several points, the agent class directly modifies the config passed along. This leads to problems when the config is used later. A better way would probably be to create a copy of the config before modifying it.
First, the problem with TRPO should be fixed now. Second, you're absolutely right, this is unexpected behavior, and what you suggest seems like a good solution. I will open an issue to track the config and the min/max value problem, and will close this one.
Hi, first of all, thanks for the hard work that is going into this project. You are saving me a ton of work. Second, I encountered some strange behavior when trying to define an agent with multiple continuous actions. All code below was run in a Jupyter notebook with Anaconda and Python 3.5:
This code crashes with the trace:
I tried different agents and encountered another strange behavior:
Crashes with:
But when I redefine config, that is, I run
again, it does not crash, but it occasionally outputs negative values for actions, although min_value = 0
{'opt_a': 0.28892395, 'opt_b': -0.10657883}
The PPO agent displays the same behavior as the VPG Agent.I have tried this with many slightly different configurations, it seems to be a consistent issue. Please let me know if you need any more code / info / data to reproduce the issue. Kindly, Jannes