tensorforce / tensorforce

Tensorforce: a TensorFlow library for applied reinforcement learning
Apache License 2.0
3.3k stars 532 forks source link

Unable to train on OpenAI Gym Envs other than CartPole #851

Closed Capsar closed 2 years ago

Capsar commented 2 years ago

Hi,

Currently working with the Tensorforce 0.65 version from github compatible with Tensorflow 2.7 and I am running into a problem where the standard tensorforce agent setup is unable to learn anything other than on CartPole.

Current setup:

    self.environment = Environment.create(environment='gym', level=level_id)

    self.agent = Agent.create(
        agent='tensorforce',
        environment=self.environment,  # alternatively: states, actions, (max_episode_timesteps)
        memory=10000,
        update=dict(unit='timesteps', batch_size=64),
        optimizer=dict(type='adam', learning_rate=0.001),
        policy=dict(network='auto'),
        objective='policy_gradient',
        reward_estimation=dict(horizon=20)
    )
    self.runner = Runner(agent=self.agent, environment=self.environment)

With training as followed (Is this better or worse than not dividing by the number?):

def train(self, epochs, number=10):
    for i in range(number):
        self.runner.run(num_episodes=epochs/number)

So the problem is that when level_id is "CartPole-v1" it trains and after a couple of training runs it achieves a good policy with good returns (reward). But when I try out "Acrobot-v1" or "MountainCar-v0" it stays at respectively -500 & -200 reward indicating no learning as this is the worst possible score for both Environments.

Could someone help me out so that the other Environments also train or maybe spot the bug? No Errors or what so ever occur when training or initializing.

Kind regards, Caspar

Capsar commented 2 years ago

Update:

I have also tried it with DQN:

    self.agent = Agent.create(
        agent='dqn',
        environment=self.environment,
        memory=memory_size,
        batch_size=64,
        network='auto',
        learning_rate=0.001,
        horizon=1,
        discount=0.95
    )

Same Environment setup but training is different:

    total_reward = 0
    for i in range(1, epochs):       
        states = self.environment.reset()
        terminal = False
        while not terminal:
            actions = self.agent.act(states=states)
            states, terminal, reward = self.environment.execute(actions=actions)
            self.agent.observe(terminal=terminal, reward=reward)
            total_reward += reward
        if i % number == 0:
            print('episode:', i, "total reward:", total_reward/number)
            total_reward = 0

Still no results for MountainCar or Acrobot, but CartPole trains within 400 episodes to a reward of 500. So I am still looking for a fix or someone to help me out here.

Capsar commented 2 years ago

Hi,

I have increased the memory size which now shows progress in the reward amount, indicating it is learning. I will close the issue.

Kind regards, Caspar

AlexKuhnle commented 2 years ago

Feel free to post your agent config if you solve these environments. Despite appearing very simple, they're not actually straightforward to solve (as compared to CartPole, which is "relatively easy").

Capsar commented 2 years ago

Hi,

Ok for Acrobot-v1, MountainCar-v0 and CartPole-v1 the following setup works:

    self.environment = Environment.create(environment='gym', level=self.env.spec.id)
    network_spec = [
        dict(type='dense', size=64), 
        dict(type='dense', size=64),
        dict(type='dense', size=64)
        ]
    # print('states', self.environment.states())
    # print('actions', self.environment.actions())

    self.agent = Agent.create(
        agent='dqn',
        states=self.environment.states(),
        actions=self.environment.actions(),
        max_episode_timesteps=self.env._max_episode_steps,
        memory=memory_size,
        batch_size=batch_size,
        network=network_spec
    )

Sometimes 2 64 layers sometimes 3. With memory_size of 10000, 100000 for cartpole and mountaincar and 200000 for acrobot. Batch_size was 32.