Closed jnqian99 closed 2 weeks ago
Submitted gym like model:
https://github.com/jnqian99/CORL/tree/main/model
the dynamics model uses GRU
@pagand please take a quick look and provide some feedback on if I am on the right track.
@jnqian99 Create a branch in project github page and push there for future collaboration.
@jnqian99 It is not clear why you have create_from_d4rl
or td3_loop_update_step
in the code.
Try to have standalone file for env and do the training in a separate file.
What is the reason for using Jax?
push the weight of your trained model in the Github as well.
Almost 1000 lines of code seems over saturated for a simple task of dynamic modeling.
look at this references to get some idea: 1- https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/ 2- https://machinelearningmastery.com/lstm-for-time-series-prediction-in-pytorch/ 3- https://homes.cs.washington.edu/~bboots/files/PSDs.pdf
@pagand I tried but cannot create a branch to model_optimize_vessel. Can you please check? Thanks!
Monday in-person meeting summary.
A few things to consider:
@pagand
Good job @jnqian99 Here some more summary:
consider different horizon for output: example iteration 0 for horizon 3: input: [s(0), ..., s(i-1), s(i), a(0), ..., a(i), a(i+1), a(i+2), a(i+3)], output: [s_hat(i+1), s_hat(i+2), s_hat(i+3)] and then auto-regressive in the next for loop, example iteration 1: input: [s(1), ..., s(i), s_hat(i+1), a(1), ...,a(i+1), a(i+2), a(i+3), a(i+4)], output: [s_hat(i+2), s_hat(i+3), s_hat(i+4)]
During the testing, since we donot have access to future actions, we repeat the last action, but it is not important, cause as you see, we are only using the first predicted output auto-regressively.
For the reward and done, find the function and just plugin values of the estimation. Protip: You can use the labels of the reward to backpropagate and update the state as another task. Meaning one of your loss is the list of predicted states ([s_hat(i+1), s_hat(i+2), s_hat(i+3)] ) and the other is your rewards ([r_hat(i+1), r_hat(i+2), r_hat(i+3)])
Try Curriculum learning for your training. Start from 10-20 auto regressive loops in the beginning of the epochs and then gradually increase it to more steps (cap at 1000).
If you can train your model to have bounded error for 1000 steps autoregressive prediction by the above consideration, you do not need to apply the next method (paper 2)
Since we donot have access to the environment or any measurement (as the simulator is completely detached), we can't use filtering method like Kalman Filter, etc. So we can use adjusted error methods like the one we have in the paper 2 to avoid add up error.
Brief summary of works done during the last couple days @pagand
I find that using s_0 to s_i with a_0 to a_i to predict s_i+1 to s_i+3 works during training to get MSE of loss of <3 and also are bounded during evaluation. (MSE <3)
However, adding a_i+1 to a_i+3 also works during traning to a lower MSE of loss, but does not work during evaluation to a MSE of loss >10
I tried adding a_i+1 to a_i+3 as last few sequence to the LSTM network. I also tried adding a_i+1 to a_i+3 as concatenation to the output to LSTM and input to the last layer of Dense network. Both give similar results
For the reward, I think it is theoretically possible to have the function. But in reality, the reward value seems to be determined by the physics model in Mujoco and is hard to determine given certain state and action I have. @pagand
https://www.gymlibrary.dev/environments/mujoco/half_cheetah/
Because there is still no effective way to incorporate a_i+1 to a_i+3 into the scene, I tried out_state_number=1 which means there is only 1 output state predicted each time, so the output structure is (batch_num, future_number, state_dim)
The results for future_num=950 can be seen here
https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/4a564787-e08c-42dc-924c-0262f1392bd2
I used curriculum training to start from future_num=20 then goto 50, 200, 500, 950
The results seems actually bounded, I also tried print out the state prediction and they seems reasonably close to the actual values.
@pagand
I seperated the dataset into 90% training and 10% evaluation to predict 980 future steps and the results shown below are still consistent with the previous one. @pagand
https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/402213d1-b027-4c63-bf64-07584986695a
When I updated sequence_num from 5 to 3, the results are still bounding:
https://wandb.ai/sfu-jnqian/ORL_optimizer/runs/5277bb83-1244-4903-8212-b4ecf9a057c6
@pagand
@jnqian99 That sounds good. Some clarification questions:
See you on Monday.
@pagand
what do you mean by adding a_i+1 to a_i+3? each iteration you need to add one new action and the previous first predicted value.
So s_0 to s_i and a_0 to a_i are concatenated and feed into LSTM. However I don't know s_i+1 to s_i+3, so I am just using zeros for s_i+1 to s_i+3 which will be concatenated to a_i+1 to a_i+3, and that is possible but I tried that and it works OK during training ( even better than not using a_i+1 to a_i+3). However, during evaluation, when I use a_i for replace a_i+1 to a_i+3, the error will blow up.
what do you mean "adding a_i+1 to a_i+3 as concatenation to the output to LSTM "?
see above
The physics model in Mujoco is fully observed in state? what is the issue? did you try having rewqrd as another labels?
For Mujoco, I think the reward is a defined function for s,a,s'. However s' is also a function of s and a in Mujoco, so I don't know how to add my own s' to get Reward, so I didn't add reward as an additional label.
I can't see any of your wandb run, either try to make it public or add me in the project group (pagand@sfu.ca)
Sorry about wandb, I uploaded to the wrong account, it should be fixed now. The sequence_num=3 run can be seen here https://wandb.ai/jnqian/ORL_optimizer/runs/6c79618b-9070-4171-b0e6-89b676b2b4a6 I also did a sequence_num=1 run: https://wandb.ai/jnqian/ORL_optimizer/runs/bf304b2f-8b50-43a5-8632-78602b48cd2d which is still bounded but not as closely bounded as sequence_num=3
what do you mean "there is still no effective way to incorporate a_i+1 to a_i+3 into the scene"
see above
There should be some way to predict more state each time, by having a list of prediction instead of a value and the structure (batch_num, future_number, state_dim) sounds reasonable.
I don't know how to do that right now.
That's really good, so it means only 3 previous state was enough to predict state and reward? And can you find the R2 score for the test dataset?
see the wandb run above, the mean of MSE is around 0.72 for sequence_num=3
Are you using the offline dataset?
Yes I am using halfcheetah-medium-v2 offline dataset
@jnqian99 Please update the issue title. Is not this one finished? if yes change it to done.
Initial submission to github
https://github.com/jnqian99/CORL/tree/main
the two files added are
https://github.com/jnqian99/CORL/blob/main/algorithms/offline/gru_brac.py
https://github.com/jnqian99/CORL/blob/main/configs/offline/gru_brac/halfcheetah/medium_v2.yaml
I have no GPU right now so I run a test with
num_epochs: 100 num_updates_on_epoch: 100
the results are for the ReBrac:
https://wandb.ai/sfu-jnqian/ReBRAC/runs/427c8d31-befa-4fbc-88bb-4cbbd94081d6?nw=nwuserjnqian
for gru_brac: https://wandb.ai/sfu-jnqian/gru_brac/runs/2230ef07-5833-4608-b838-c0b92364727e?nw=nwuserjnqian
It seems gru_brac is gaining an advantage in normalized_score_mean at the end
CON: train speed is slower than ReBrac
Next step: Bug fixing, adapt to different time series steps, clean up of code, add more documentation.
Also need to test on more gym config on higher epochs and updates. num_epochs: 1000 num_updates_on_epoch: 1000