Blog reinforce Discrete method

sezan92 commented 2 years ago

Objective

After discrete reinforce method of Reinforcement learning algorithm has been implemented. The next task is to make a blog about reinforce method. This issue is to work on that

Tasks

[x] Plan about the structure
[x] Write a draft
[x] Think about visualization
[x] Write full draft
[ ] review in grammarly
[x] make sure code works and visualization is fine.
[x] Publish

sezan92 commented 2 years ago

Plan

Video
TLDR
how to run
previous RL blogs
Why policy gradient method?
What is reinforce?
Intuition
Code walk through (need to plan separately)
conclusion

sezan92 commented 2 years ago

pLAN THE BLOG

add why it failed. how to improve

sezan92 commented 2 years ago

how to run

copy-paste. https://github.com/sezan92/RL_study#reinforce

previous rl blogs

written 2 years ago.
need a lot of updates
for now okay,
https://sezan92.github.io/2020/03/22/RL.html

TODO

why policy gradient method?

sezan92 commented 2 years ago

Why policy gradient method

In the value gradient method (check the name if it is correct) we do not learn the policy directly.
we learn the policy according to the value.
WHAT IF WE don't want value, we want directly the policy.
value gradient method works well if the policy is discrete. if the policy is discrete we can know which action will have best value. but if it is continuous, then there is a big problem.
we don't know which policy is the best.
also there is no way to explore different actions.
for the above reasons, we use policy gradient methods TODO: cross check the reasos.

sezan92 commented 2 years ago

Cross check advantage and disadvantages of Policy gradient methods

https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/
got another reason from this link. better convergence. THe value gradient methods switch to different actions, making learning very erratic . policy gradients choose actions based on the gradients. making switching smoother and converging better.

TODO

What is reinforce? intuition.

sezan92 commented 2 years ago

WIP (what is reinforce method)

[x] https://towardsdatascience.com/an-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c
[x] explain in your language

sezan92 commented 2 years ago

WIP (What is reinforce method)

Reinforce method intuition

Still thinking about good intuition.

Cricket can be good example. but many people don't understand or hate cricket
another can be soccer. but how to explain? maybe I will need a new visualization

sezan92 commented 2 years ago

What is Reinforce method

Reinforce method intuition

[x] Got an idea to make codes for visualization like https://github.com/sezan92/sezan92.github.io/issues/14#issuecomment-1188814537 . maybe we can copy the code with proper credit. at first.
[ ]

sezan92 commented 2 years ago

Reinforce method intuition

let's think of a game if the action is 1 , you know you will get 10 points, and if 0 you will get -1 points. so our target will be to make the agent give more probability to have action 1.

issue

we might have to design the model and gradient ascent for this problem.
isn't it better to use the link from https://github.com/sezan92/sezan92.github.io/issues/14#issuecomment-1188814537 ?
need to think about it . lets take help from it.

Soln

think about the intuition blog later
first finish the code walkthrough.

sezan92 commented 2 years ago

Code Walkthrough

How to run,

same as readme file with update
use that one as a guide. also refer to that.

Explaining the code.

TODO: need to explain in the comment. use the comment to explain to people.

sezan92 commented 2 years ago

Code walkthrough

WIP https://github.com/sezan92/RL_study/pull/32/commits/81ad94231e2830355d416d11d2e05bc75d576889

sezan92 commented 2 years ago

Code Walkthrough WIP

[x] codes comment https://github.com/sezan92/RL_study/compare/main...code-comment
TODO
[x] conclusion
[ ] check on cart-pole
[ ] show on docker

sezan92 commented 2 years ago

Conclusion

the problem is the performance here didn't improve, actually became worse compared to that of DQN family. what are the reasons? Both have same policy. same epsilon greedy technique. I suspect the reason is in the DQN , we were correctly evaluating states. But in the case of reinforce method, we are not correctly getting values. Also reinforce method is known for high variance. so I think next best think is the upgrade to reinforce. Actor-critic methods. DDPG. Deep deterministic policy gradient. let us check it out

TODO

[x] check the performance on docker
[x] refactor and check for cart-pole

sezan92 commented 2 years ago

Reinforce method intuition

[ ] make a dummy environment with some fixed rewards for showing the effects of reinforce method,
[ ] test with a dummy reinforce method, gradient ascent effects

sezan92 commented 1 year ago

Reinforce method intuition

[x] got a good link https://towardsdatascience.com/beginners-guide-to-custom-environments-in-openai-s-gym-989371673952

sezan92 commented 1 year ago

Reinforce method

[x] Working on dummy environment

sezan92 commented 1 year ago

Update 2022/10/24

[x] First how to use the code
[x] then go and describy Policy
[x] then go and describe reinforce
[x] reformat the structure in parallel

sezan92 commented 1 year ago

use this method https://user-images.githubusercontent.com/11025093/181000526-13295938-95ae-4fa4-b372-3bbaf9c90a56.jpeg

sezan92 commented 1 year ago

How to use the code

same as readme
add the video

Policy

it is same as previous model , to make fair comparison. it is in RL_study/rl/rl/policy.py
simple architecture

def init(self, s_size=4, fc1_size=150, fc2_size=120, a_size=2): super(Policy, self).init() self.fc1 = nn.Linear(s_size, fc1_size) self.fc2 = nn.Linear(fc1_size, fc2_size) self.final = nn.Linear(fc2_size, a_size)

def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.final(x) return F.softmax(x, dim=1)


- Action is a bit different

```python

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

not only returns the action, but also returns the log probability. Why? it is coming later.

sezan92 commented 1 year ago

Reinforce

[x] intuition of reinforce
- [ ] How do we get the equation
[x] need of reinforce
[x] the reason for probability

sezan92 commented 1 year ago

Update 2022/10/28

Reinforce code

def reinforce_discrete(
    env,
    policy,
    model_weights_path,
    n_episodes=1000,
    max_t=1000,
    gamma=1.0,
    print_every=100,
    learning_rate=1e-2
):
    scores = []

make an optimizer

    optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
    for i_episode in range(1, n_episodes + 1):
        saved_log_probs = []
        rewards = []
        state = env.reset()
        states = [state]
        for t in range(max_t):

Get the action and log probability, get the reward, and save the rewards with log probability


            action, log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            state, reward, done, _ = env.step(action)
            rewards.append(reward)
            states.append(state)
            if done:
                break
        scores.append(sum(rewards))

        expected_rewards = get_expected_reward(rewards, gamma)
        state_values = get_state_values(rewards)

        policy_loss = []
        for i, log_prob in enumerate(saved_log_probs):

Advantage according to the equation, get the policy loss and backpropagate

            A = expected_rewards[i] - np.mean(expected_rewards)
            policy_loss.append((-log_prob * A).float())
        policy_loss = torch.cat(policy_loss).sum()

        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if i_episode % print_every == 0:
            print(
                "INFO: Episode {}\tAverage Score: {:.2f}".format(
                    i_episode, np.mean(scores[-print_every:])
                )
            )
        if np.mean(scores[-print_every:]) >= 195.0:
            print(
                "INFO: Environment solved in {:d} episodes!\tAverage Score: {:.2f}".format(
                    i_episode - 100, np.mean(scores[-print_every:])
                )
            )
            break
    print(f"INFO: Saving the weights in {model_weights_path}")
    torch.save(policy.state_dict(), model_weights_path)
    return scores

sezan92 commented 1 year ago

Update 2022/11/01

[x] show score plot with same number of epochs
[ ] compare with previous models (show the plots side by side)
[ ] show the video
[x] explain the performance difference
[x] share the code

sezan92 commented 1 year ago

Update 2022/11/03

show score plot with same number of epochs

reinforce_with_plot

sezan92 commented 1 year ago

Update 2022/11/03

from DDQN https://sezan92.github.io/images/Double_Deep_Q_Network_files/output_31_0.png
it has moving average. need to use same function here

sezan92 commented 1 year ago

https://github.com/sezan92/RL_study/issues/42

sezan92 commented 1 year ago

Update 2022/11/04

Comparison

Previous models were more stable. Had better scores. Reinforce didnt improve the score. Why? [dont know myself. Need to learn]

how can we improve

advantage value
What is the best advantage value?
Gradient accent equation has a problem ?

sezan92 commented 1 year ago

Update 2022/11/10

[x] https://github.com/sezan92/RL_study/pull/43

sezan92 commented 1 year ago

Update 2022/11/14

[x] https://github.com/sezan92/RL_study/pull/43

sezan92 commented 1 year ago

Update 2022/11/25

reason for poor performance of reinforce

Also, there are some limitations associated with REINFORCE algorithm:

The update process is very inefficient. We run the policy once, update once, and then throw away the trajectory.
The gradient estimate is very noisy. There is a possibility that the collected trajectory may not be representative of the policy.
There is no clear credit assignment. A trajectory may contain many good/bad actions and whether or not these actions are reinforced depends only on the final total output.

based on https://towardsdatascience.com/policy-gradient-methods-104c783251e0

sezan92 commented 1 year ago

Update 2022/11/27

https://github.com/sezan92/sezan92.github.io/issues/14#issuecomment-1327126423 this leads us to next policy gradient methods. Actor-critic methods

sezan92 commented 1 year ago

Update 2022/11/29

[x] Why policy gradient method draft
[x] https://github.com/sezan92/sezan92.github.io/pull/22/commits/13014ff2fe6b187214d1e9b94b86a97f78e4bc84

TODO

[x] History of Reinforce
[x] tune up reason for policy gradient method

sezan92 commented 1 year ago

Update 2022/12/02

[x] https://github.com/sezan92/sezan92.github.io/pull/22/commits/4dbe86e29b3f47ffe90f3797d9ee153044752e76
[x] reason for policy gradient method done

sezan92 commented 1 year ago

Update 2022/12/06

[x] https://github.com/sezan92/RL_study/issues/44
[x] Opened a public repo sezan92/rl

sezan92 commented 1 year ago

Update 2022/12/08

[x] https://github.com/sezan92/sezan92.github.io/pull/22/commits/14dc4ac0b903441a3bbe87645d48fbbeda7c4bbb

sezan92 commented 1 year ago

Update 2022/12/15

[x] https://github.com/sezan92/sezan92.github.io/pull/22/commits/206b1e2859c1b01ad59144dbd357a2746c50cff4

TODO

[x] think more about how to explain https://github.com/sezan92/sezan92.github.io/issues/14#issuecomment-1294476917

sezan92 commented 1 year ago

Update 2022/12/16

[x] https://github.com/sezan92/sezan92.github.io/pull/22/commits/687edfe1c1c828e5365efb59e6c70d44576434fa

TODO

[x] Add plot result, and video
[ ] Conclude
[ ] Future task

sezan92 commented 1 year ago

Update 2022/12/20

[x] Added score plot https://github.com/sezan92/sezan92.github.io/pull/22/commits/1cea1c4e07d1263a6acd70a27e56053b3be5f2c3

TODO

[ ] make video

sezan92 commented 1 year ago

Update 2022/12/23

[x] added conclusion https://github.com/sezan92/sezan92.github.io/pull/22/commits/ae63589aed3f78e0ce4b5ab33e77f2478cfb5509

TODO

[x] https://github.com/sezan92/RL_study/issues/45

sezan92 commented 1 year ago

Update 2023/01/17

[x] https://github.com/sezan92/RL_study/issues/45

sezan92 commented 1 year ago

Update 2023/01/18

[x] planned about score comparison

TODO

[ ] get the plots of previous methods
[ ] get a tool to get the data from them
[ ] plot all of them on same page

sezan92 commented 1 year ago

Update 2023/02/01

[x] Added Episode 100 video

sezan92 commented 1 year ago

Update 2023/02/02

sezan92 commented 1 year ago

Update 2023/02/07

[x] #22

sezan92 / sezan92.github.io