sezan92 / sezan92.github.io

1 stars 1 forks source link

Blog reinforce Discrete method #14

Closed sezan92 closed 1 year ago

sezan92 commented 2 years ago

Objective

After discrete reinforce method of Reinforcement learning algorithm has been implemented. The next task is to make a blog about reinforce method. This issue is to work on that

Tasks

sezan92 commented 2 years ago

Plan

sezan92 commented 2 years ago

pLAN THE BLOG

sezan92 commented 2 years ago

how to run

copy-paste. https://github.com/sezan92/RL_study#reinforce

previous rl blogs

TODO

sezan92 commented 2 years ago

Why policy gradient method

sezan92 commented 2 years ago

Cross check advantage and disadvantages of Policy gradient methods

TODO

sezan92 commented 2 years ago

WIP (what is reinforce method)

sezan92 commented 2 years ago

WIP (What is reinforce method)

Reinforce method intuition

Still thinking about good intuition.

sezan92 commented 2 years ago

What is Reinforce method

Reinforce method intuition

sezan92 commented 2 years ago

Reinforce method intuition

let's think of a game if the action is 1 , you know you will get 10 points, and if 0 you will get -1 points. so our target will be to make the agent give more probability to have action 1.

issue

Soln

sezan92 commented 2 years ago

Code Walkthrough

How to run,

Explaining the code.

sezan92 commented 2 years ago

Code walkthrough

sezan92 commented 2 years ago

Code Walkthrough WIP

sezan92 commented 2 years ago

Conclusion

the problem is the performance here didn't improve, actually became worse compared to that of DQN family. what are the reasons? Both have same policy. same epsilon greedy technique. I suspect the reason is in the DQN , we were correctly evaluating states. But in the case of reinforce method, we are not correctly getting values. Also reinforce method is known for high variance. so I think next best think is the upgrade to reinforce. Actor-critic methods. DDPG. Deep deterministic policy gradient. let us check it out

TODO

sezan92 commented 2 years ago

Reinforce method intuition

sezan92 commented 1 year ago

Reinforce method intuition

sezan92 commented 1 year ago

Reinforce method

sezan92 commented 1 year ago

Update 2022/10/24

sezan92 commented 1 year ago

use this method https://user-images.githubusercontent.com/11025093/181000526-13295938-95ae-4fa4-b372-3bbaf9c90a56.jpeg

sezan92 commented 1 year ago

How to use the code

Policy

def init(self, s_size=4, fc1_size=150, fc2_size=120, a_size=2): super(Policy, self).init() self.fc1 = nn.Linear(s_size, fc1_size) self.fc2 = nn.Linear(fc1_size, fc2_size) self.final = nn.Linear(fc2_size, a_size)

def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.final(x) return F.softmax(x, dim=1)


- Action is a bit different

```python

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

not only returns the action, but also returns the log probability. Why? it is coming later.

sezan92 commented 1 year ago

Reinforce

sezan92 commented 1 year ago

Update 2022/10/28

Reinforce code

def reinforce_discrete(
    env,
    policy,
    model_weights_path,
    n_episodes=1000,
    max_t=1000,
    gamma=1.0,
    print_every=100,
    learning_rate=1e-2
):
    scores = []

make an optimizer

    optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
    for i_episode in range(1, n_episodes + 1):
        saved_log_probs = []
        rewards = []
        state = env.reset()
        states = [state]
        for t in range(max_t):

Get the action and log probability, get the reward, and save the rewards with log probability


            action, log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            state, reward, done, _ = env.step(action)
            rewards.append(reward)
            states.append(state)
            if done:
                break
        scores.append(sum(rewards))
        expected_rewards = get_expected_reward(rewards, gamma)
        state_values = get_state_values(rewards)

        policy_loss = []
        for i, log_prob in enumerate(saved_log_probs):

Advantage according to the equation, get the policy loss and backpropagate

            A = expected_rewards[i] - np.mean(expected_rewards)
            policy_loss.append((-log_prob * A).float())
        policy_loss = torch.cat(policy_loss).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if i_episode % print_every == 0:
            print(
                "INFO: Episode {}\tAverage Score: {:.2f}".format(
                    i_episode, np.mean(scores[-print_every:])
                )
            )
        if np.mean(scores[-print_every:]) >= 195.0:
            print(
                "INFO: Environment solved in {:d} episodes!\tAverage Score: {:.2f}".format(
                    i_episode - 100, np.mean(scores[-print_every:])
                )
            )
            break
    print(f"INFO: Saving the weights in {model_weights_path}")
    torch.save(policy.state_dict(), model_weights_path)
    return scores
sezan92 commented 1 year ago

Update 2022/11/01

sezan92 commented 1 year ago

Update 2022/11/03

show score plot with same number of epochs

reinforce_with_plot

sezan92 commented 1 year ago

Update 2022/11/03

sezan92 commented 1 year ago

https://github.com/sezan92/RL_study/issues/42

sezan92 commented 1 year ago

Update 2022/11/04

Comparison

Previous models were more stable. Had better scores. Reinforce didnt improve the score. Why? [dont know myself. Need to learn]

how can we improve

sezan92 commented 1 year ago

Update 2022/11/10

sezan92 commented 1 year ago

Update 2022/11/14

sezan92 commented 1 year ago

Update 2022/11/25

Also, there are some limitations associated with REINFORCE algorithm:

The update process is very inefficient. We run the policy once, update once, and then throw away the trajectory.
The gradient estimate is very noisy. There is a possibility that the collected trajectory may not be representative of the policy.
There is no clear credit assignment. A trajectory may contain many good/bad actions and whether or not these actions are reinforced depends only on the final total output.

based on https://towardsdatascience.com/policy-gradient-methods-104c783251e0

sezan92 commented 1 year ago

Update 2022/11/27

sezan92 commented 1 year ago

Update 2022/11/29

TODO

sezan92 commented 1 year ago

Update 2022/12/02

sezan92 commented 1 year ago

Update 2022/12/06

sezan92 commented 1 year ago

Update 2022/12/08

sezan92 commented 1 year ago

Update 2022/12/15

TODO

sezan92 commented 1 year ago

Update 2022/12/16

TODO

sezan92 commented 1 year ago

Update 2022/12/20

TODO

sezan92 commented 1 year ago

Update 2022/12/23

TODO

sezan92 commented 1 year ago

Update 2023/01/17

sezan92 commented 1 year ago

Update 2023/01/18

TODO

sezan92 commented 1 year ago

Update 2023/02/01

sezan92 commented 1 year ago

Update 2023/02/02

sezan92 commented 1 year ago

Update 2023/02/07