pagand / ORL_optimizer

offline RL optimizer
0 stars 0 forks source link

Project2 #5

Closed jnqian99 closed 2 weeks ago

jnqian99 commented 1 month ago

Initial submission to github

https://github.com/jnqian99/CORL/tree/main

the two files added are

https://github.com/jnqian99/CORL/blob/main/algorithms/offline/gru_brac.py

https://github.com/jnqian99/CORL/blob/main/configs/offline/gru_brac/halfcheetah/medium_v2.yaml

I have no GPU right now so I run a test with

num_epochs: 100 num_updates_on_epoch: 100

the results are for the ReBrac:

https://wandb.ai/sfu-jnqian/ReBRAC/runs/427c8d31-befa-4fbc-88bb-4cbbd94081d6?nw=nwuserjnqian

for gru_brac: https://wandb.ai/sfu-jnqian/gru_brac/runs/2230ef07-5833-4608-b838-c0b92364727e?nw=nwuserjnqian

It seems gru_brac is gaining an advantage in normalized_score_mean at the end

CON: train speed is slower than ReBrac

Next step: Bug fixing, adapt to different time series steps, clean up of code, add more documentation.

Also need to test on more gym config on higher epochs and updates. num_epochs: 1000 num_updates_on_epoch: 1000

jnqian99 commented 1 month ago

Submitted gym like model:

https://github.com/jnqian99/CORL/tree/main/model

the dynamics model uses GRU

@pagand please take a quick look and provide some feedback on if I am on the right track.

pagand commented 1 month ago

@jnqian99 Create a branch in project github page and push there for future collaboration.

pagand commented 1 month ago

@jnqian99 It is not clear why you have create_from_d4rl or td3_loop_update_step in the code. Try to have standalone file for env and do the training in a separate file. What is the reason for using Jax? push the weight of your trained model in the Github as well. Almost 1000 lines of code seems over saturated for a simple task of dynamic modeling.

look at this references to get some idea: 1- https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/ 2- https://machinelearningmastery.com/lstm-for-time-series-prediction-in-pytorch/ 3- https://homes.cs.washington.edu/~bboots/files/PSDs.pdf

jnqian99 commented 1 month ago

@pagand I tried but cannot create a branch to model_optimize_vessel. Can you please check? Thanks!

Image

jnqian99 commented 1 month ago

Monday in-person meeting summary.

A few things to consider:

@pagand

pagand commented 1 month ago

Good job @jnqian99 Here some more summary:

consider different horizon for output: example iteration 0 for horizon 3: input: [s(0), ..., s(i-1), s(i), a(0), ..., a(i), a(i+1), a(i+2), a(i+3)], output: [s_hat(i+1), s_hat(i+2), s_hat(i+3)] and then auto-regressive in the next for loop, example iteration 1: input: [s(1), ..., s(i), s_hat(i+1), a(1), ...,a(i+1), a(i+2), a(i+3), a(i+4)], output: [s_hat(i+2), s_hat(i+3), s_hat(i+4)]

During the testing, since we donot have access to future actions, we repeat the last action, but it is not important, cause as you see, we are only using the first predicted output auto-regressively.

For the reward and done, find the function and just plugin values of the estimation. Protip: You can use the labels of the reward to backpropagate and update the state as another task. Meaning one of your loss is the list of predicted states ([s_hat(i+1), s_hat(i+2), s_hat(i+3)] ) and the other is your rewards ([r_hat(i+1), r_hat(i+2), r_hat(i+3)])

Try Curriculum learning for your training. Start from 10-20 auto regressive loops in the beginning of the epochs and then gradually increase it to more steps (cap at 1000).

If you can train your model to have bounded error for 1000 steps autoregressive prediction by the above consideration, you do not need to apply the next method (paper 2)

Since we donot have access to the environment or any measurement (as the simulator is completely detached), we can't use filtering method like Kalman Filter, etc. So we can use adjusted error methods like the one we have in the paper 2 to avoid add up error.

jnqian99 commented 1 month ago

Brief summary of works done during the last couple days @pagand

I find that using s_0 to s_i with a_0 to a_i to predict s_i+1 to s_i+3 works during training to get MSE of loss of <3 and also are bounded during evaluation. (MSE <3)

However, adding a_i+1 to a_i+3 also works during traning to a lower MSE of loss, but does not work during evaluation to a MSE of loss >10

I tried adding a_i+1 to a_i+3 as last few sequence to the LSTM network. I also tried adding a_i+1 to a_i+3 as concatenation to the output to LSTM and input to the last layer of Dense network. Both give similar results

jnqian99 commented 1 month ago

For the reward, I think it is theoretically possible to have the function. But in reality, the reward value seems to be determined by the physics model in Mujoco and is hard to determine given certain state and action I have. @pagand

https://www.gymlibrary.dev/environments/mujoco/half_cheetah/

jnqian99 commented 1 month ago

Because there is still no effective way to incorporate a_i+1 to a_i+3 into the scene, I tried out_state_number=1 which means there is only 1 output state predicted each time, so the output structure is (batch_num, future_number, state_dim)

The results for future_num=950 can be seen here

https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/4a564787-e08c-42dc-924c-0262f1392bd2

I used curriculum training to start from future_num=20 then goto 50, 200, 500, 950

The results seems actually bounded, I also tried print out the state prediction and they seems reasonably close to the actual values.

@pagand

jnqian99 commented 1 month ago

I seperated the dataset into 90% training and 10% evaluation to predict 980 future steps and the results shown below are still consistent with the previous one. @pagand

https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/402213d1-b027-4c63-bf64-07584986695a

jnqian99 commented 1 month ago

When I updated sequence_num from 5 to 3, the results are still bounding:

https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/5277bb83-1244-4903-8212-b4ecf9a057c6

@pagand

pagand commented 4 weeks ago

@jnqian99 That sounds good. Some clarification questions:

  1. what do you mean by adding a_i+1 to a_i+3? each iteration you need to add one new action and the previous first predicted value.
  2. what do you mean "adding a_i+1 to a_i+3 as concatenation to the output to LSTM "?
  3. The physics model in Mujoco is fully observed in state? what is the issue? did you try having rewqrd as another labels?
  4. I can't see any of your wandb run, either try to make it public or add me in the project group (pagand@sfu.ca)
  5. what do you mean "there is still no effective way to incorporate a_i+1 to a_i+3 into the scene"
  6. There should be some way to predict more state each time, by having a list of prediction instead of a value and the structure (batch_num, future_number, state_dim) sounds reasonable.
  7. That's really good, so it means only 3 previous state was enough to predict state and reward? And can you find the R2 score for the test dataset?
  8. Are you using the offline dataset?

See you on Monday.

jnqian99 commented 4 weeks ago

@pagand

what do you mean by adding a_i+1 to a_i+3? each iteration you need to add one new action and the previous first predicted value.

So s_0 to s_i and a_0 to a_i are concatenated and feed into LSTM. However I don't know s_i+1 to s_i+3, so I am just using zeros for s_i+1 to s_i+3 which will be concatenated to a_i+1 to a_i+3, and that is possible but I tried that and it works OK during training ( even better than not using a_i+1 to a_i+3). However, during evaluation, when I use a_i for replace a_i+1 to a_i+3, the error will blow up.

what do you mean "adding a_i+1 to a_i+3 as concatenation to the output to LSTM "?

see above

The physics model in Mujoco is fully observed in state? what is the issue? did you try having rewqrd as another labels?

For Mujoco, I think the reward is a defined function for s,a,s'. However s' is also a function of s and a in Mujoco, so I don't know how to add my own s' to get Reward, so I didn't add reward as an additional label.

I can't see any of your wandb run, either try to make it public or add me in the project group (pagand@sfu.ca)

Sorry about wandb, I uploaded to the wrong account, it should be fixed now. The sequence_num=3 run can be seen here https://wandb.ai/jnqian/ORL_optimizer/runs/6c79618b-9070-4171-b0e6-89b676b2b4a6 I also did a sequence_num=1 run: https://wandb.ai/jnqian/ORL_optimizer/runs/bf304b2f-8b50-43a5-8632-78602b48cd2d which is still bounded but not as closely bounded as sequence_num=3

what do you mean "there is still no effective way to incorporate a_i+1 to a_i+3 into the scene"

see above

There should be some way to predict more state each time, by having a list of prediction instead of a value and the structure (batch_num, future_number, state_dim) sounds reasonable.

I don't know how to do that right now.

That's really good, so it means only 3 previous state was enough to predict state and reward? And can you find the R2 score for the test dataset?

see the wandb run above, the mean of MSE is around 0.72 for sequence_num=3

Are you using the offline dataset?

Yes I am using halfcheetah-medium-v2 offline dataset

pagand commented 2 weeks ago

@jnqian99 Please update the issue title. Is not this one finished? if yes change it to done.