Order of env.step() - Githubissues

ocram444 / EldenRL

Reinforcement Learning for Elden Ring on Windows11

GNU General Public License v3.0

26 stars 11 forks source link

Order of env.step() #2

Open svarner9 opened 5 months ago

svarner9 commented 5 months ago

Hello,

I am running the code and trying to make some improvements. One thing that I came across that I am questioning is the order in which the code is performed in the env.step() function. I am curious why the observation and reward are obtained before the action is taken?

From my understanding, the agent makes a decision which is the 'action' argument that is passed to env.step(action). Then based on this action, it should expect to get back a reward. It seems right now that the action->reward->action->reward cycle is out of sync, since the reward is calculated before the current action is taken.

Is there a reason that it was done this way?

Thank you so much for starting development on this! I look forward to your response :)

Best, Sam

ocram444 commented 5 months ago

Hi Sam,

Yes, that's true. The reason we collect the observation and calculate the reward before taking the action is because the reward always counts for the previous step. So, we need to calculate it first before taking the action of the current step.

The action-reward-action-reward cycle really is action->next observation->previous reward->action-... in reality.We wanted to give the game some time to process the step instead of calculating its reward immediately. While it is possible to take the action first and immediately calculate its reward, this approach would either delay the observation by the time it takes to calculate the reward before it's handed to the model to decide the next action, or the reward is calculated for the previous step as it is now.

Thank you so much for your interest in the development! I look forward to any further questions or feedback you may have.

Best regards, Marco

On Wed, Jul 3, 2024, 08:53 Samuel Varner @.***> wrote:

Hello,

I am running the code and trying to make some improvements. One thing that I came across that I am questioning is the order in which the code is performed in the env.step() function. I am curious why the observation and reward are obtained before the action is taken?

From my understanding, the agent makes a decision which is the 'action' argument that is passed to env.step(action). Then based on this action, it should expect to get back a reward based on that action. It seems right now that the action->reward->action->reward cycle is out of sync, since the reward is calculated before the current action is taken.

Is there a reason that it was done this way?

Thank you so much for starting development on this! I look forward to your response :)

Best, Sam

— Reply to this email directly, view it on GitHub https://github.com/ocram444/EldenRL/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3C7Z7LKZUIHOMMXXEYR6CDZKONWZAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4DOOBSGQ4TQOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

svarner9 commented 5 months ago

Hi Marco,

Thank you for your quick reply!

I definitely understand the need to have a delay between taking the action, and making the observation (and calculating the reward). However, I am just concerned that the model does not know that there is a step delay between the action it takes and the observation/reward it receives.

It looks like the env.step(action) is called internally during the model.learn() process, which means that model.learn() is assuming by default that the action it gives to .step() corresponds to the reward and state returned (or at least I think so).

I looked at some of the default gym environments to see if this was the case and I think it is indeed... Here is the CartPole example (CartPole). In this example, you can see in their .step() function that the action is taken at the beginning, then the state is updated, then the reward is calculated. They don't have to wait at all between the action and observation because they don't have any realtime dynamics, they are just integrating forward with simple kinematics.

I am thinking maybe the order in our step function should be something like this instead:

Take action
Wait for ~0.5 seconds
Record observation
Calculate reward
Return current state and reward

What are your thoughts on this? Do you know if the model is correctly interpreting the current setup with the observation and reward applying to the action from the previous step?

Finally, is there a good place to chat more about ideas? I joined the EldenBot discord server, but I was wondering if this repo has a discord server as well?

Thanks so much!

Best, Sam

ocram444 commented 5 months ago

Hi Sam,

The approach we're using is quite standard for OpenAI Gym and reinforcement learning in general. The reward typically corresponds to the action taken in the previous step, which is why it's calculated first before taking the current action. This method ensures the environment processes the step and updates its state correctly before moving on.

I think your suggested approach can work well for real-time environments, and it’s a valid alternative. However, calculating the reward for the previous step is a common practice in reinforcement learning. This ensures a smoother transition and accurate processing within the environment (At least ChatGPT says so).

For further discussion and more ideas, the EldenBot Discord server is the right place. The community there is very helpful and knowledgeable about this and the other projects.

Best regards, Marco

On Wed, Jul 3, 2024 at 6:44 PM Samuel Varner @.***> wrote:

Hi Marco,

Thank you for your quick reply!

I definitely understand the need to have a delay between taking the action, and making the observation (and calculating the reward). However, I am just concerned that the model does not know that there is a step delay between the action it takes and the observation/reward it receives.

It looks like the env.step(action) is called internally during the model.learn() process, which means that model.learn() is assuming by default that the action it gives to .step() corresponds to the reward and state returned (or at least I think so).

I looked at some of the default gym environments to see if this was the case and I think it is indeed... Here is the CartPole example (CartPole https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py). In this example, you can see in their .step() function that the action is taken at the beginning, then the state is updated, then the reward is calculated. They don't have to wait at all between the action and observation because they don't have any realtime dynamics, they are just integrating forward with simple kinematics.

I am thinking maybe the order in our step function should be something like this instead:

Take action

Wait for ~0.5 seconds

Record observation

Calculate reward

Return current state and reward

What are your thoughts on this? Do you know if the model is correctly interpreting the current setup with the observation and reward applying to the action from the previous step?

Finally, is there a good place to chat more about ideas? I joined the EldenBot discord server, but I was wondering if this repo has a discord server as well?

Thanks so much!

Best, Sam

— Reply to this email directly, view it on GitHub https://github.com/ocram444/EldenRL/issues/2#issuecomment-2206776785, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3C7Z7JQQVNM3M5IOQ5P4P3ZKQS5TAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG43TMNZYGU . You are receiving this because you commented.Message ID: @.***>

svarner9 commented 5 months ago

Okay I see. I'm fairly new to this so I'm still learning some stuff. I found this post which further supports the validity of delayed rewards... StackExchange.

I have been training a model for some time now and improvement seems to either be very slow or altogether nonexistent. I have been thinking of some ideas to improve the learning, and the most reasonable idea so far is to seed a model with some successful trajectories to begin with.

Namely, instead of allowing the model to generate random moves that create reinforcement, I would just like to feed a prescribed set of moves (say from a boss run that I have done, where I recorded my inputs and captured the environment every half-second) into the model as well as the corresponding rewards. This way the model starts already with some good examples.

Do you know if there is a simple way to train the PPO model (or maybe some other model that can be later restarted as PPO) with predetermined training data? The main idea is that the search for an optimal strategy will be faster since the model will be trained on some attempts that were optimal, before becoming fully self sufficient and progressing with the current PPO training.

Best, Sam

ocram444 commented 5 months ago

Hi Sam,

Yes, training in reinforcement learning is known to take a very long time. That's why it's always parallelized in projects where that's possible, like running 1000 instances of the game to train simultaneously. That's obviously not possible with Elden Ring, though.

The training does work, though, and when training on simple bosses like the Beastman of Farm Azula, Patches, or Mad Pumpkin Head, the agent did eventually beat them in my test runs. Somewhere on the Discord, there are training result screenshots, and you can look into TensorBoard logging to visualize the rewards and episode lengths if you want the hard data.

Regarding the semi-supervised learning approach, there is another project that was shared on the EldenBot Discord server some time ago. I forget the name, but it pre-programmed optimal moves, collected and extracted meaning from sound cues and visuals to determine what move the boss is performing, and then had an RL agent decide from one of the optimal moves to respond with. That seemed to work well for specific bosses but takes a lot of work and will only ever work for bosses you specifically adapt the codebase to.

Our reinforcement learning approach is supposed to be more general, with the agent being able to choose from basic inputs as actions and learn strategies it can apply to multiple bosses and situations.

Best regards, Marco

On Thu, Jul 4, 2024, 06:34 Samuel Varner @.***> wrote:

Okay I see. I'm fairly new to this so I'm still learning some stuff. I found this post which further supports the validity of delayed rewards... StackExchange https://ai.stackexchange.com/questions/12551/openai-gym-interface-when-reward-calculation-is-delayed-continuous-control-wit .

I have been training a model for some time now and improvement seems to either be very slow or altogether nonexistent. I have been thinking of some ideas to improve the learning, and the most reasonable idea so far is to seed a model with some successful trajectories to begin with.

Namely, instead of allowing the model to generate random moves that create reinforcement, I would just like to feed a prescribed set of moves (say from a boss run that I have done, where I recorded my inputs and captured the environment every half-second) into the model as well as the corresponding rewards. This way the model starts already with some good examples.

Do you know if there is a simple way to train the PPO model (or maybe some other model that can be later restarted as PPO) with predetermined training data? The main idea is that the search for an optimal strategy will be faster since the model will be trained on some attempts that were optimal, before becoming fully self sufficient and progressing with the current PPO training.

Best, Sam

— Reply to this email directly, view it on GitHub https://github.com/ocram444/EldenRL/issues/2#issuecomment-2208101750, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3C7Z7P7QGXT3DLELAMM763ZKTGGXAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGEYDCNZVGA . You are receiving this because you commented.Message ID: @.***>

svarner9 commented 5 months ago

Okay that makes senss. I did see that project you are referring to I believe.

I don't really want to pre-program the optimal moves, rather I want the framework to remain exactly the same, but I just want to feed it successful runs (at least in the beginning) before switching to fully unsupervised. The current training data is generated by the agent taking random moves and then collecting observations and rewards. I can make the same type of training data from a successful run, and feed that as if the agent did it themselves. I am having trouble figuring out how to do this with the PPO model though. It doesn't seem to have an option for user supplied training data. Maybe there is a way to modify mode.learn() to take actions from file instead of generating them from the current policy. That is my next thing to try out.

I have been trying to train on the first mini boss in the DLC (Blackgaol Knight) and the agent ends up just dying within the first 2-3 seconds most times, which is probably a big reason why the training is very stagnant.

I am on the discord server but I don't see anything there, all of the channels are empty for me. Maybe I can't see the history because I just recently joined?

Best, Sam

ocram444 commented 5 months ago

Hey,

Yes the agents character should be appropriately leveled to not get one shot by the boss. The discord server should work. https://discord.gg/TKaHrukq

Berst, Marco

On Thu, Jul 4, 2024 at 7:06 PM Samuel Varner @.***> wrote:

Okay that makes senss. I did see that project you are referring to I believe.

I don't really want to pre-program the optimal moves, rather I want the framework to remain exactly the same, but I just want to feed it successful runs (at least in the beginning) before switching to fully unsupervised. The current training data is generated by the agent taking random moves and then collecting observations and rewards. I can make the same type of training data from a successful run, and feed that as if the agent did it themselves. I am having trouble figuring out how to do this with the PPO model though. It doesn't seem to have an option for user supplied training data. Maybe there is a way to modify mode.learn() to take actions from file instead of generating them from the current policy. That is my next thing to try out.

I have been trying to train on the first mini boss in the DLC (Blackgaol Knight) and the agent ends up just dying within the first 2-3 seconds most times, which is probably a big reason why the training is very stagnant.

I am on the discord server but I don't see anything there, all of the channels are empty for me. Maybe I can't see the history because I just recently joined?

Best, Sam

— Reply to this email directly, view it on GitHub https://github.com/ocram444/EldenRL/issues/2#issuecomment-2209360296, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3C7Z7LPI2O5T46XK47ESSTZKV6KTAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGM3DAMRZGY . You are receiving this because you commented.Message ID: @.***>