stanfordnmbl / osim-rl

Reinforcement learning environments with musculoskeletal models
http://osim-rl.stanford.edu/
MIT License
886 stars 250 forks source link

Reward calculation for RunEnv #43

Closed fgvbrt closed 7 years ago

fgvbrt commented 7 years ago

Hi, I noticed that when you calculate reward you don't update last_state to current_state. The only place where last_state is updated is the reset method. This means that you don't return one-step reward, but the total reward up to current episode step. I think you should fix it :smile:

kidzik commented 7 years ago

It is updated here https://github.com/stanfordnmbl/osim-rl/blob/master/osim/env/osim.py#L168 I agree it could be cleaner but it is fine (for now).

fgvbrt commented 7 years ago

@kidzik , Maybe I don't understand something, but: 1) in your link you update last_action, but not last state 2) I noticed that when agent moved forward, it always receive positive reward, even if he is staying on the same place or slightly moving backward. Here is typical reward curve over time (lowest value is zero) http://imgur.com/a/be8qF 3) I explicitly printed print env.last_state[env.STATE_PELVIS_X], env.current_state[env.STATE_PELVIS_X] and the first part is always zero, but it should be previous value of the second part

Am i missing something?

kidzik commented 7 years ago

Unfortunately, that seems right, sorry for this quick incorrect assessment. Since it would quite significantly affect current results and since the current objective function still should lead to 'moving as far as possible', we will try to choose the least harmful solution. Thanks a lot for the careful analysis and the report!

AdamStelmaszczyk commented 7 years ago

Wow, thanks so much fgvbrt for finding this.

I think this issue should be open, until not fixed, so that people can spot it easier. It may also be a good idea to add a "High priority" label or similar.

Since it would quite significantly affect current results and since the current objective function still should lead to 'moving as far as possible', we will try to choose the least harmful solution.

I understand that backward compatibility is a factor, however I think it's far better to just fix it and rollout next version of osim-rl. Reasons:

  1. The challenge says it's "Reinforcement learning environments with musculoskeletal models". The current implementation is not a reinforcement environment (by the definition in David Silver's course and on Wikipedia), because compute_reward doesn't give "immediate reward after transition". I think all of the RL libraries I saw rely on this definition.

  2. We still have 60 days of competition left, only first 30 passed. Nobody so far achieved better result on the leaderboard than a baseline submission by admin. So I don't feel the cost of non-backward compatible changes is high enough to justify not fixing the env.

LiberiFatali commented 7 years ago

Since this task is computationally expensive, the agent requires a lot of episodes and timesteps just to learn how to walk. Unless we have enormous processing power in hand, any breaking changes would make our training process restarted from scratch, and current trained agents are wasted.

This is off topic, however I don't see obstacles in spMohanty (Admin) submission. Is that really the baseline?

ViktorM commented 7 years ago

I would vote for the correct reward formulation, with the recording of last step, 60 days till the end of the challenge is enough time to train and retrain agent.

And join to the question about obstacles.

kidzik commented 7 years ago

We decided to change the reward and I've prepared a tentative branch https://github.com/stanfordnmbl/osim-rl/tree/iss43 . Since the new and the old measures are correlated we will try to discount the current scores to make them comparable. @LiberiFatali has a good point and we will try to address it fairly, yet for the sake of a more meaningful competition, it seems important to correct the error.

We are testing the new environment now and we will post a new benchmark promptly. Sorry for the mess and thank you for your understanding and discussion!

kidzik commented 7 years ago

If you tried the iss43 branch, please let us know your experience (here or privately), so that we can roll it out sooner. Note that the difference is that the new reward function is a derivative of the old reward function (modulo a small error on ligaments). So for the models trained previously, taking the cumulative sum of rewards as a reward should yield similar experience, if your algorithm relies on that.

fgvbrt commented 7 years ago

Hi, I made similar modification by my own and noticed that when I run my previously trained agents I always obtain very high reward related to lig_pen, actually it is higher than reward for moving forward. Here are typical values of these both reward (first is "movind forward reward " and second is "lig_pen") Maybe you should carefully look at scaling constant for this two parts of reward, I am not sure now that this is possible to learn good policy with this reward: 0.00476280700601 0.0164990073588 0.00413903056655 0.0151963977149 0.00363156499301 0.0128450193735 0.00329636943162 0.0110873209941 0.00307123597201 0.0106107392919 0.00313117791239 0.0111621550313 0.00316452449907 0.0110113162483 0.00302489540878 0.00990658850094 0.00333417318044 0.0087098156985 0.00354273304158 0.00657035539484 0.00423457159953 0.00875325506608 0.00533576647573 0.0165944487833 0.0045510766111 0.0180316189574 0.00353006299279 0.01755132944 0.00277684779291 0.016527103372 0.00245134188634 0.0185393699774 0.00230928468998 0.0208992816518 0.0014924378922 0.0203374645968 0.000731370049012 0.0208180524229 0.000236296382949 0.022882241171

kidzik commented 7 years ago

Indeed, with the cumulative sum ligament forces were negligible. Now they are not negligible anymore and we also have some issues with returning our benchmark model. I will make them very small in iss43 branch for testing (not an ideal solution, but we want the models to at least move forward...). Please let us know about any updates -- hopefully we can quickly go back to the normal operation mode of the challenge...

syllogismos commented 7 years ago

Using the info variable, can you give us a list of different rewards with varying factors. So that we can compare rewards(multiplying factors) at the same time. From this, people can compute from their trained agents, various total_rewards and distances travelled.

How did you come up with 0.0001 earlier? Ideally, we cant make the ligament factor become so insignificant that it plays no role in the training, then there wont be any difference between vanilla distance reward and this. But what is the ideal scenario, Will it be okay if an agent walks a longer distance, but have a lesser total_reward.

kidzik commented 7 years ago

I agree that ideally, we want to keep the ligament penalty fairly high (which should lead to more human-like movement), but we also need to reduce the chance of not getting any reasonable results at all, as being in the middle of the challenge. Also, it would be good to keep the setting similar to what is now in the master branch, where ligament forces are negligible (yet, they can still matter for similar distances travelled).

0.0001 came from other experiments in our lab in a different setting. We kept it as-is, since it was ok out-of-the-box (with the cumulative reward).

AdamStelmaszczyk commented 7 years ago

Just an idea. How about defining the reward as:

# (...) like in compute_reward
delta_x = self.current_state[self.STATE_PELVIS_X] - self.last_state[self.STATE_PELVIS_X]

penalty = math.sqrt(lig_pen) / 1000

return delta_x - abs(delta_x) * penalty
# equivalent of:
# return delta_x * (1 - penalty) if delta_x >= 0 else delta_x * (1 + penalty)

On fgvbrt's data (I only flipped delta_x sign for lower half to have negative values) it would look like that:

delta_x         √lig_pen / 10000    √lig_pen / 1000     abs(delta_x) * penalty  reward
---------------------------------------------------------------------------------------------
0.004762807     0.0164990074        0.1649900736        0.0007858159            0.0039769911
0.0041390306    0.0151963977        0.1519639771        0.0006289835            0.003510047
0.003631565     0.0128450194        0.1284501937        0.0004664752            0.0031650898
0.0032963694    0.011087321         0.1108732099        0.0003654791            0.0029308904
0.003071236     0.0106107393        0.1061073929        0.0003258808            0.0027453551
0.0031311779    0.011162155         0.1116215503        0.0003495069            0.002781671
0.0031645245    0.0110113162        0.1101131625        0.0003484558            0.0028160687
0.0030248954    0.0099065885        0.099065885         0.0002996639            0.0027252315
0.0033341732    0.0087098157        0.087098157         0.0002904003            0.0030437728
0.003542733     0.0065703554        0.0657035539        0.0002327702            0.0033099629
-0.0042345716   0.0087532551        0.0875325507        0.0003706629            -0.0046052345
-0.0053357665   0.0165944488        0.1659444878        0.000885441             -0.0062212075
-0.0045510766   0.018031619         0.1803161896        0.0008206328            -0.0053717094
-0.003530063    0.0175513294        0.1755132944        0.000619573             -0.004149636
-0.0027768478   0.0165271034        0.1652710337        0.0004589325            -0.0032357803
-0.0024513419   0.01853937          0.1853936998        0.0004544633            -0.0029058052
-0.0023092847   0.0208992817        0.2089928165        0.0004826239            -0.0027919086
-0.0014924379   0.0203374646        0.203374646         0.000303524             -0.0017959619
-0.00073137     0.0208180524        0.2081805242        0.000152257             -0.000883627
-0.0002362964   0.0228822412        0.2288224117        0.000054069             -0.0002903663
kidzik commented 7 years ago

Thanks, @AdamStelmaszczyk! The reward mathematically definitely makes sense and is probably a good approximation of what the old reward was, but we will keep the simpler version, i.e. the current one on the iss43 branch. We are working on rolling it out and updating the scores.

ctmakro commented 7 years ago

off-topic: As I said before, whoever wrote the code should take more sleep...

the reward function is harzardous, since the agent, if came to a place further than origin, will tend to stay balanced, comparing to running forward, since by staying balanced he has less chance of falling and getting higher Expected Future Reward. This design make the agent more reluctant(confirmed by everyone, I believe) as he go further, thus should definetely be changed.

@LiberiFatali I am Qin Yongliang on the Leaderboard. It takes days to train a performing agent I know, but in my honest opinion, the environment should really correct its flaws, otherwise all the learning effort devoted to the competetion is useless in the real world...

@kidzik it's a good chance we change so much things about the competition. so again, if possible, add 2 degrees of observation: contact force of left and right foot. These could directly help a human agent to walk (your eyes might be fully covered but still can walk). Right now I believe most people have to derive that thru second order derivative(which a traditional Deep RL agent is not capable of calculating), or thru distance-to-ground.

also, please do not suggest "an agent is fit for task X, since task Y is just slightly diff from task X, the agent should fit Y very quickly" again and again, cuz that's not how RL work. Since the expected reward is skewed (strongly correlated to position), there's no magic way of skewing that back (deleting individual neurons will have disasterous effect to the network), if we don't discard our memory and resample the actions unbiased. Human, the best RL agent on earth so far, behave just like that: it's often harder for a man to learn a skill, if he already mastered an opposite skill, and believed deeply about its usefulness over other skills.

kidzik commented 7 years ago

I completely agree. Thanks for understanding the true cause of these issues ;)

LiberiFatali commented 7 years ago

Thanks for the info. Will the timestep be changed to 1000 (it is 500 now) in the grader as well?

kidzik commented 7 years ago

Yes, it will be 1000 steps.

fgvbrt commented 7 years ago

Hi, two small details: 1) Here https://github.com/stanfordnmbl/osim-rl/blob/master/docs/README.md for version 1.4.1 you give link to issue 31, which can be confusing for some people, because actual issue is 43. 2) This phrase should be corrected, because forces are divided by 10^-7

The penalty in the total reward is equal to the sum of forces generated by ligaments over the trial, divided by 10000.

marioyc commented 7 years ago

Also related to the evaluation, would it be possible to increase the number of runs an agent is evaluated on? As mentioned in this article which discusses on the subject of reproducibility in RL, 5 runs is already a stretch if we want to have a good idea about the average result of an agent, I understand we could do more submissions to get a better estimate, but nothing in the rules gives incentive to the participants to do so, and also displaying the variance of the results of each agent would give us a clearer idea of just how significant is the difference in the results coming from two agents.

In a related topic, since currently the average of all submissions is calculated, this would only give a biased estimate about how the last agent is performing, since normally the participants will continue improving their solutions through the competition, wouldn't it be better to keep something like "average of last K submissions"?

AdamStelmaszczyk commented 7 years ago

I think Mario made a very good point.

Seems a low hanging fruit to increase the number of runs from 3 to, for example, 5 or 10. Adding a column to the leader-board with a variance from runs would also help.

About incentives. The way it works now as far as I know, is that the leader-board keeps your best score, where score is an average return from 3 runs. I think these contestants, who realize that there might be a big variance in scores, have incentive to submit multiple times, to move the score from "the average from runs" to "close to the maximum from the average from runs".

Which seems alright to me, because this is not how the final results are obtained. From README:

In order to avoid overfitting to the training environment, the top 10 participants will be asked to resubmit their solutions in the second round of the challenge. Environments in the second round will have the same structure but they will be initialized with different seeds. The final ranking will be based on results from the second round.

As I understood, top 10 scores will be obtained by average from 3 runs. This could be problematic. If re-run again (with different seeds) it could give another final top 10 ordering.

One could again increase this 3 to, for example, 50 runs (only for top 10 for final results).

As far as I remember submitting with 3 runs was taking about 6 minutes on my machine, so about 2 minutes per run*. So 10 runs should be about 20 minutes, which is fine for submit. And about 100 minutes for final submits from top 10. This assumes that crowdAI.org can handle up to 10 clients communicating with them constantly for about 100 minutes, likely in the same time.

*However, when I was submitting, skeleton was quickly falling down, which stopped the run at probably less than 100 steps. With models using all 1000 steps, runs could be probably even 10x longer...

kidzik commented 7 years ago

@LiberiFatali thanks for pointing out @marioyc @AdamStelmaszczyk That definitely makes sense, but let's come back to this idea as soon as we fix the main issue from this thread. Btw. We know this change can seem chaotic -- it's actually quite stressful from the organizer's standpoint, so thank you all for support and understanding :+1: We are getting there!

parilo commented 7 years ago

Also it would be nice to test at different seeds at submit in order to avoid overfitting during the challenge. I mean 3 tests for 2 or 3 seeds. So It would be 6 or 9 episodes.

kidzik commented 7 years ago

The grader computes the right reward now. Rewards of solutions submitted before 13:19 CEST, July 29th are adjusted.

LiberiFatali commented 7 years ago

When submitting to new grader, I get this error "502 server error bad gateway". This usually happens at timesteps > 500.

AdamStelmaszczyk commented 7 years ago

README says the penalty is divided by 10^7 (10,000,000) whereas in the current run.py on master they are divided by 10^8 (https://github.com/stanfordnmbl/osim-rl/blob/master/osim/env/run.py#L76).

marioyc commented 7 years ago

@AdamStelmaszczyk it says 10e-8 so it's okay, weird way to write it though

fgvbrt commented 7 years ago

@AdamStelmaszczyk 1e-8 == 10*-7, you can check in python. {number1}e{number2} in python means number110**number2

AdamStelmaszczyk commented 7 years ago

Oh right, sorry, it says 10e-8, somehow I "saw" 1e-8, thanks.

Yep, that's a bit confusing, it's not standard form (normalized scientific notation).

kidzik commented 7 years ago

@LiberiFatali we got a few Broken pipe errors on our side. I've seen that some of your submissions went through. Is it better now? Is there any pattern? @AdamStelmaszczyk yes, that's correct and super-confusing. We will work out these details later though.

LiberiFatali commented 7 years ago

It seems ok now. Thank you :+1:

kidzik commented 7 years ago

Thanks, closing the issue for now. We will discuss internally other requests which popped up here. Thank you all for reporting!