The videos SUCK, and here's why

stanfordnmbl / osim-rl

Reinforcement learning environments with musculoskeletal models

http://osim-rl.stanford.edu/

MIT License

884 stars 249 forks source link

The videos SUCK, and here's why #52

Closed ctmakro closed 7 years ago

ctmakro commented 7 years ago

Edit: don't read this, scroll the page down

The osim-rl-grader is forked from gym, therefore I cannot file an issue on it, so I'll just file it here.

https://github.com/kidzik/osim-rl-grader/blob/master/worker_dir/simulate.py#L32-L34

according to the code, you store each submission's action histories in Redis, then generate MP4s by reading from Redis and simulate with the same environment again.

except that you can't: by using the same seed, you generate the same environment, but you can't guarantee the same dynamics given the same actions. The floating point operation is not perfect but chaotic due to precision limitations, and that chaotic error accumulates frame by frame and causes inaccuracy later in re-simulation.

(since the RunEnv is highly nonlinear with a high FPS, this problem became more apparent.)

so the videos on the leaderboard apparently scored much less than the participant's score suggested. (In case you wonder: it's NOT due to the participant's poor performance in the first run.)

the correct way to do this: generate PNGs on the fly during submission, and convert them to MP4s later. If doing so causes too much overhead, consider logging the state of the armature and replay that state for later image generation, instead of replaying the actions.

The videos would look much better that way, making this competition exceeding DeepMind's youtube video on walking agent in popularity, which is apparently what this whole thing is all for.

ctmakro commented 7 years ago

sorry for the language, just that you know, i m from an engineering backgroud, and what the code did from my point of view, is basically trying to control a closed-loop system using open-loop method, and expect no error would accumulate, which means disaster in the context of control system design.

spMohanty commented 7 years ago

@ctmakro : Thanks for your insightful comments. And we would definitely love to generate the videos on the fly.

The reason we go with a replay based approach, is because of a critical bug in simbody+opensim integration which makes the video generation very unstable on headless servers. More about it here (https://github.com/simbody/simbody/issues/563)

The grader and the video generation run on two completely different servers for very strong but frustrating reasons.

So, now if we do integrate the video generation on the actual grader, then the grader will crash on every other submission. Its a weird broken pipe error, which we havent yet been able to solve even while working with the OpenSim team.

But, the hacky way we solve it, is by noticing that the error doesnt occur if we restart the node. (This sucks more, I know, but it works when we dont have an alternate solution !!)

Now coming back to the videos scoring less on the leaderboard; you do make a valid point that action replay is not the best approach, because of non determinism in floating point operations, but the environment which is used to generate the videos is exactly the same as that of the grader (grading server -> Create Image -> AMI -> Launch another instance), hence that risk is hugely minimised. Plus, more importantly the videos are in any case supposed to be approximations/simulations.

Which brings us to, why do the videos on the leaderboard not match the score ? That is the case when your model doesnot generalise well, and performs really well in some cases and not so well in many other cases. The grader runs a total of 3 simulations, and the video generated is of just the first simulation. (As we speak, we are working on adapting the gif generation process for the video to represent all the 3 simulations)

So, if the video doesn't represent your score, that means your model performs really bad on the first simulation, and well on the other two, so bring up the average score to the number you see as your score on the leaderboard. Note that, in any case, models which dont generalise well wont go a long way, especially because in the second round of the challenge, we will be testing the top performers against a much larger number of simulations. And in any case, at the end of the challenge we would be happy to release the seeds used to instantiate RunEnv for the current three simulations, and you can happily validate the performance of your models across all the three simulations.

Finally, the core of all the frustration lies in the bug in the OpenSim + Simbody integration, referenced in the issue I mentioned earlier. And both the code bases are opensource. We would really appreciate it, if you want to help solve the specific bug, (and hence make the video generation architecture much more clean :D ).

Also, if you have a better design in mind for the video generation process, we are all ears, and looking forward to your contributions.

Cheers, Mohanty

ctmakro commented 7 years ago

Edit: dont read this, read the following post

@spMohanty thanks for the reply. To clarify, I did score >15 point on EACH of the 3 runs while submitting the 16.9 point submisson, because I wrote code on the client side that accumulates the reward and print it out. I was watching my console so I know it. Please, don't underestimate my effort spent on this competition :) If you check the submisson history (if there is such a history), you'll find that our submissons (mostly 1000 steps*3) are always longer than our videos (obviously less than 1000 steps each).

(BTW, to get an average score of 16, with a submission of 7, my other submissions will need to score 20.5 each, which is not currently possible by my agent - I know this because that's my agent :)

(and BTW also, it's not just me, everybody on top of the chart, their video sucked. This should not be because of "first run too difficult", since the environment shown in the video is simple with small obstacles.

On the environment replay precision issue, there's another source of error. Actions are floating point values. You serialized them, stored them in Redis, then read from Redis, then deserialized them. Can you make sure this process is perfect/lossless? JSON for example, does not guarantee full precision of it's floating point representation.

A quick solution to this problem: On submission, don't apply the actions directly to the agent, instead you store them into Redis first, then read it out from Redis, then apply.

Finally, can you tell us what action history did you use for the emperical comparison? Has the simulation sustain all 1000 steps?

ctmakro commented 7 years ago

So I did check the code, and yes, you made a great mistake by using str() without thinking:

https://github.com/kidzik/osim-rl-grader/blob/master/gym_http_server.py#L150-L167

    def step(self, instance_id, action, render):
        env = self._lookup_env(instance_id)
        if isinstance( action, six.integer_types ):
            nice_action = action
        else:
            nice_action = np.array(action)
        if render:
            env.render()
        [observation, reward, done, info] = env.step(nice_action)
        obs_jsonable = env.observation_space.to_jsonable(observation)

    if env.trial == 1:  
            rPush("CROWDAI::SUBMISSION::%s::trial_1_actions"%(instance_id), str(nice_action.tolist()))

        rPush("CROWDAI::SUBMISSION::%s::actions"%(instance_id), str(nice_action.tolist())) #WTF
        rPush("CROWDAI::SUBMISSION::%s::observations"%(instance_id), str(obs_jsonable))
        rPush("CROWDAI::SUBMISSION::%s::rewards"%(instance_id), str(reward))
        return [obs_jsonable, reward, done, info]

str(array_of_float.tolist()) is the problem. by converting floating point numbers to string and back, you lost accuracy.

A quick solution to this problem: On submission, don't apply the actions directly to the agent, instead you store them into Redis first, then read it out from Redis, then apply.

spMohanty commented 7 years ago

Dear @ctmakro,

My apologies for my inexperience in programming,

Since you didn't write BitCoin, I will forgive you this time.

and many thanks for forgiving me.

you made a great mistake by using str() without thinking str(array_of_float.tolist()) is the problem. by converting floating point numbers to string and back, you lost accuracy. On the environment replay precision issue, there's another source of error. Actions are floating point values. You serialized them, stored them in Redis, then read from Redis, then deserialized them. Can you make sure this process is perfect/lossless? JSON for example, does not guarantee full precision of it's floating point representation.

Indeed, I did make a great mistake by using str without thinking, and the precision loss during serialization and deserialization is indeed a completely valid issue.

But, I would point out that, the the osim-rl client, actually already serializes the actions that you send, and the grading server deserializes them on receiving the POST request. The client does it through json.dumps (https://github.com/stanfordnmbl/osim-rl/blob/master/osim/http/client.py#L39) , which in turn internally uses repr for serialization of the floating point values. Note that, this design decision was inherited from the openai-gym-http-client (https://github.com/openai/gym-http-api/blob/master/gym_http_client.py#L37) which is also currently used for submissions to all openai-gym environments.

str in particular, in this use case (where all the values are between [0,1]) will always have an effective precision of 11 significant digits. While repr will always have an precision of 17 significant digits. I have hence modified the grader to serialize values using repr instead of str.

Now, how badly the loss of precision from 17 digits to 11 digits compounds the problem is something very internal to OpenSim, and I would check with @kidzik to see if it is indeed the cause of why the videos "suck". I would also welcome you to analyse the average number of significant digits in the action values of your submissions after serialization using repr 😉

The problem of the videos being only an approximation of the score still remains, and the perfect solution for the same is to solve the OpenSim Broken Pipe error that I referenced in my previous reply. I will give that another go this weekend, and I also equally hope that I do manage to fix it.

In the meantime, the key updates to the grading setup are: just to be safe and consistent, repr is used for serialization now, and we now generate the simulation for all the three trials instead of just 1 trial to better approximate the score on the leaderboard. All the videos will be regenerated over the next two days, and I do hope the videos suck less after that. (My own submission of an average score of ~12 actually makes much more sense after viewing all the three simulations together 😄)

A quick solution to this problem: On submission, don't apply the actions directly to the agent, instead you store them into Redis first, then read it out from Redis, then apply.

I do not understand how I can store the actions in redis first without applying them to agent, especially when I can only obtain the observations (that your models will require to produce the subsequent actions) by applying the actions to the agent. Did I get something wrong ?

Wishing you a great weekend.

Cheers, Mohanty

syllogismos commented 7 years ago

open sim on osx and ubuntu behaved differently in the past. for example, the episode I generated on ubuntu, wont replay exactly the same on osx. This might be another reason why the generated gif's are different than the leaderboard scores. Just make sure opensim on scores machine and gif generation machine are the same.

So can you check if you are able to replay the episode generated by the scoring machine in the gif generating machine. This can be done by using the final reward to see if both the episode got replayed perfectly. Sometimes you wont even be able to replay it all the steps cause it diverges within few steps somewhere in the middle.

The other reason might be that the seed you are passing in reset is not behaving as expected. As in, the env is not behaving the same way for the same actions. This we can also check on our local machines.

syllogismos commented 7 years ago

In [1]: from osim.env import RunEnv

In [2]: e = RunEnv(visualize=False)
Updating Model file from 30000 to latest format...
Loaded model gait9dof18musc_Thelen_BigSpheres.osim from file /Users/anil/anaconda2/lib/python2.7/site-packages/osim/env/../models/gait9dof18musc.osim

In [3]: e.reset(difficulty=2, seed=0)
Out[3]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 1,
 1,
 100,
 0,
 0]

In [4]: e.reset(difficulty=2, seed=0)
Out[4]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 0.9453866664450463,
 1.1282118971476272,
 1.8141399379930383,
 0.0077763347381071346,
 0.076325483191120966]

In [5]: e.reset(difficulty=2, seed=0)
Out[5]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 1.0766724743764324,
 0.9306695966026792,
 1.6446259893163733,
 0.0052692310549401489,
 0.077682965868789006]

In [6]: e.reset(difficulty=2, seed=0)
Out[6]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 1.037379207069778,
 1.0811978341981092,
 1.1143378747324282,
 -0.0099782368750401355,
 0.075488876894325846]

In [7]:

Is this the expected behaviour? When I reset the env with the same seed, it is giving me different observations every time. Its clear if you see the last elements in the list. The first time I reset it, it doesn't even have the difficulty settings of 2, because the the psosos muscles are 1 and the obstacle is 100 mts away. And other resets, we are not getting the same first observation for the same seed.

If reset is not giving you the same first observation, you can't really replay the episodes.

syllogismos commented 7 years ago

I think reset first resets the environment with old difficulty settings, and then sets the new difficulty settings.

So that is why when you create the environment and reset it for the first time, you get observations as if the difficulty setting is 0. But when you do act after resetting, you will get the psos muscles and obstacles as you expect in further steps.

kidzik commented 7 years ago

Thanks, yes, there is an issue with the first observation after the reset and it should be fixed. It is consistent however between the replay and the grader.

spMohanty commented 7 years ago

@syllogismos : Yeah I did try to keep both the environments as close to each other as possible, by using the image of the grading env to creating the env for replay. And regarding the issue with reset, as mentioned by @kidzik , the behaviour even if unexpected is consistent between the grader and the replay servers.

Coming to the rewards, and divergence in general, there is indeed a small difference in rewards too, and at the moment we are regenerating the videos for all the submissions for all the trials, and in the process also keeping a log of the observations and associated rewards. After this is complete, we can have a more quantified estimate of the divergence.

ctmakro commented 7 years ago

I do not understand how I can store the actions in redis first without applying them to agent, especially when I can only obtain the observations (that your models will require to produce the subsequent actions) by applying the actions to the agent. Did I get something wrong ?

Let's look at the code again:

    def step(self, instance_id, action, render):
        env = self._lookup_env(instance_id)
        if isinstance( action, six.integer_types ):
            nice_action = action
        else:
            nice_action = np.array(action)
        if render:
            env.render()
        [observation, reward, done, info] = env.step(nice_action)
        obs_jsonable = env.observation_space.to_jsonable(observation)

    if env.trial == 1:  
            rPush("CROWDAI::SUBMISSION::%s::trial_1_actions"%(instance_id), str(nice_action.tolist()))

        rPush("CROWDAI::SUBMISSION::%s::actions"%(instance_id), str(nice_action.tolist())) #WTF
        rPush("CROWDAI::SUBMISSION::%s::observations"%(instance_id), str(obs_jsonable))
        rPush("CROWDAI::SUBMISSION::%s::rewards"%(instance_id), str(reward))
        return [obs_jsonable, reward, done, info]

as seen above, on each call to step(id, action, render), the code did the following:

assign action to nice_action, do numpy conversion if necessary
send nice_action to env.step(), let the osim run for a while, then get [o,r,d,i] in return
convert nice_action to a string, and store it into Redis

then when generating the video,

nice_action is read back from Redis, with low precision, so let's call it bad_action
send bad_action to env.step(), and generate a frame

as you can see, on the first run the env took nice_action, but on the second run the env took bad_action, hence the divergence.

Assume two environments are identical and deterministic (the situation mentioned by syllogismos won't happen), the quick solution is to apply bad_action in both runs. this way the result from two runs will no longer diverge.

therefore, you should change the submission code into:

assign...(same)
convert nice_action to a string, and store it into Redis
read nice_action back from Redis. Now it became bad_action, and since the serialization&deserialization are deterministic (yes, lossy but deterministic), you can be sure that the bad_actions for the two runs will be identical.
send bad_action to env.step(), let the osim run for a while...(rest are the same)

since your rPush() calls don't rely on each other, it really doesn't matter if you store the actions first, then store the observations later.

you might wonder, won't this setting degrade the agent's performance in the submission run? in short: no it won't make any significant difference, because on submission run the system is running in closed loop mode. It will work even if your str() has only 8-bit precision. For a 1000-step simulation you should never rely on precision; the chaotic nature will magnify the slightest difference. But you can rely on determinism.

Interestingly,

this is a lazy solution, exploiting the deterministic nature of ser/deserialization process. "lazy" because that's why you choose str().
but, even if you find a way to ser/des float with very high precision, you should still {store, then read, then apply}, to make sure the two runs are fed with the same actions.
but once you {store, then read, then apply}, there seems to have no need for a high precision ser/deserializer.

ctmakro commented 7 years ago

I took a good read into @spMohanty 's broken pipe issue. To solve that issue I will have to start a Linux machine and run the grading server. I may also need to start Redis. Could you provide a minimalist python file, say test_record.py, that is less than 50 lines, without database or Amazon S3 code, that could recreate the problem? consider publish that somewhere so more people could try their hands on. You used a modified version of simbody, so do we have to/where could we get access to your compiled binary? Is the currently running video generation server running your modified version, or the original version?

Or, you could just take the "bad_action" solution as mentioned above. It will work as long as the ser/deserialization process is deterministic.

syllogismos commented 7 years ago

In [8]: e1 = RunEnv(visualize=False)
Updating Model file from 30000 to latest format...
Loaded model gait9dof18musc_Thelen_BigSpheres.osim from file /Users/anil/anaconda2/lib/python2.7/site-packages/osim/env/../models/gait9dof18musc.osim

In [9]: e1.reset(difficulty=2, seed=0)
Out[9]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 1,
 1,
 100,
 0,
 0]

In [10]: e1.reset(difficulty=2, seed=0)
Out[10]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 1.1081613453261496,
 0.9253750205054676,
 1.6395004072840016,
 0.027480258975811217,
 0.11113264404012851]

In [11]: e1.reset(difficulty=2, seed=0)
Out[11]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 0.8512827065964473,
 0.7447020273894278,
 1.8788279969308195,
 0.0072943603149147494,
 0.052553553171769879]

In [12]: e1.reset(difficulty=2, seed=0)
Out[12]:
[-0.05,
 0.0,
 0.91,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.06973405523475405,
 0.9707656285552124,
 0.0,
 0.0,
 0.007169537779780744,
 1.5365721883823606,
 0.0,
 0.91,
 -0.09650084892621281,
 0.9964310485677471,
 0.007987580127344573,
 -0.027441466796053905,
 0.007987580127344573,
 -0.027441466796053905,
 -0.11968333174236659,
 0.022952398528571172,
 -0.11968333174236659,
 0.022952398528571172,
 0.9089392567145634,
 0.78377120979066,
 1.1522105103950677,
 -0.034746749602716093,
 0.15965886027881709]

I'm not sure if you can say the current buggy reset is being consistent. Because I created a different environment, and only the first reset gave the same observation every time, and the subsequent resets again vary. The only way to replay properly is make sure the first observation is the same. As reset is being inconsistent, maybe if you can force osim-rl to take whatever first observation you have in the stored episodes, don't know if its possible. Then only you can really generate the respective videos of current submissions. Currently on the leaderboard, submission scores of 13-18 are moving only 4-6 on the respective videos. I don't think precision thing will improve this situation as much as the inconsistent initial states. Because I believe precision didn't cause much problems earlier.

ctmakro commented 7 years ago

Coming to the rewards, and divergence in general, there is indeed a small difference in rewards too, and at the moment we are regenerating the videos for all the submissions for all the trials, and in the process also keeping a log of the observations and associated rewards. After this is complete, we can have a more quantified estimate of the divergence.

Scientific methods are overkill here. Just compare the episode length; a diverged agent survives much less steps.

spMohanty commented 7 years ago

@ctmakro : Note that the serialization is happening using repr. Meaning: We have repr(nice_action.tolist()) and not str(nice_action.tolist())

The numpy array, or the list of floating points that your agent spits out, is sent over a post request by osim-rl after serialisation through json.dumps. Meaning irrespective of the precision of the action you generate, the server always receives actions with a floating point precision of 17.

I am convinced that the problem we are facing is not because of the loss of precision, as the precision of the actions received by the grading server and the replay server is exactly the same now. 17.

Also the issue reported by @syllogismos of the env.reset behaving in an unexpected way, but @kidzik confirms that it is consistent between the grading server and the replay server.

The problem we are facing is because the grading server and the replay server, even if created from the exactly same image, do not respond back in the exact same way. And the compounding of the errors in some cases possibly leads to a divergence. So, if we send the same action to both the grading server and the replay server, they will mostly respond back with slightly different observations. I am recording the difference now for all the submissions when regenerating the videos for all the three trials; and would be happy to report the divergence for all the submissions here.

If we can find a way around the inability to deterministically (or with a reasonable amount of non determinism) "seed" these environments, then I believe we can guarantee that the videos are exactly the same.

spMohanty commented 7 years ago

@ctmakro : Its not actually very difficult to replicate the broken pipe error. And you do not even need the modified version of simbody that we use. In any case, it is accessible at : https://github.com/simbody/simbody/issues/563

But the best way to replicate the issue is by creating a new instance on AWS with Ubuntu 16.04. Setting up the usual osim-rl environment from anaconda.

Then setup your favourite fake display server, xpra or xvfb. And try to run a sample script (https://github.com/stanfordnmbl/osim-rl/blob/master/scripts/basic.py) while ensuring visualisation=True.

The first try should work fine. If you try to run the same script again, either referencing the same fake display or a new one, you will see the famed Broken Pipe error. In some cases, it runs for a few tries, but nevertheless fails sooner or later. The fix has been to restart the server, and then it deterministically runs again.

Here is another issue reported on the same issue : https://github.com/stanfordnmbl/osim-rl/issues/13

Let me know if you have trouble replicating it. I will hangout on the gitter channel tomorrow morning when trying to work on this again.

ctmakro commented 7 years ago

another simple solution: since you guys wrote the environment, you have all the access to the model. you can just log the positions of every joint per frame, and replay that (directly set the values via python) to generate video. it will be much (500x) faster than generating thru simulation.

ctmakro commented 7 years ago

The numpy array, or the list of floating points that your agent spits out, is sent over a post request by osim-rl after serialisation through json.dumps. Meaning irrespective of the precision of the action you generate, the server always receives actions with a floating point precision of 17.

I may have to be more precise here.

when talking about precision, you will think that, ah, since the precision is already lost during JSON ser/des, it doesn't matter if we str() or repr() it once more. It can't loose more precision, can it?

let's say my agent generated the action 0.5678901234 . after JSON ser/des, the grader got 0.5678901075.

but after you repr() it, it become very precise, say 0.56780107502(the trailing 2 is a result of approximating binary with decimal). you store that into Redis. after you deserialize it from Redis, you got 0.56780107502. The precision is still higher than the JSON version, but has a difference of 0.00000000002. Risk of divergence confirmed.

so the correct way to put it: the absolute precision doesn't matter; to produce perfectly identical result, the numbers don't have to be precise, they just have to be identical. that’s why I suggest storing and reading then apply.

kidzik commented 7 years ago

@syllogismos that's right there is an inconsistency and I was just trying to figure it out. The reason is silly -- we have a condition here https://github.com/stanfordnmbl/osim-rl/blob/master/osim/env/run.py#L229-L230 and it fails with seed == 0, thanks for catching this one.

ctmakro commented 7 years ago

my solution, in code

    def step(self, instance_id, action, render):
        env = self._lookup_env(instance_id)
        if isinstance( action, six.integer_types ):
            nice_action = action
        else:
            nice_action = np.array(action)
        if render:
            env.render()

    if env.trial == 1:  
            rPush("CROWDAI::SUBMISSION::%s::trial_1_actions"%(instance_id), str(nice_action.tolist()))

        rPush("CROWDAI::SUBMISSION::%s::actions"%(instance_id), str(nice_action.tolist()))
        bad_action = rRead("CROWDAI::SUBMISSION::%s::actions"%(instance_id), -1) 
        # read the last item from the list formed by rPush()
        # this rRead operation should be identical to the video generation code.

        [observation, reward, done, info] = env.step(bad_action)
        obs_jsonable = env.observation_space.to_jsonable(observation)

        rPush("CROWDAI::SUBMISSION::%s::observations"%(instance_id), str(obs_jsonable))
        rPush("CROWDAI::SUBMISSION::%s::rewards"%(instance_id), str(reward))
        return [obs_jsonable, reward, done, info]

spMohanty commented 7 years ago

@ctmakro : The rPush followed by rRead is redundant i believe. This is what I do now to guarantee that the exact same action is applied on both the servers.

ctmakro commented 7 years ago

@spMohanty that's exactly what I mean (and yes, I don't have to read from Redis, just deserialize would be enough), cool

parilo commented 7 years ago

In last submission my agent never fall in all 3 episodes but on the video I see instant falls at all 3 episodes at start. So after you apply deterministic serialization fix mentioned above will I get something about 1-2 score instead of 21 (the same for other top competitors)? So, is it reasonable to include such trick also in training environment to make agent be able to adapt to such ser/des issue during training? I think client training env must have exact the same behavior as server side grader(client must be able to apply exactly the same actions locally and on the server). Reproducing on client side and server side is important thing, isn't it?

In addition. As all top solutions have significantly worse behavior on the video is it means that we will add some really big bias bug into the grader by applying this repr/eval trick and break all top solutions? I mean that it is strange that it may affect all solutions. My last best solution were 17 score and had poor performance on the video. Before submission I tested it on 30 different seeds and it give about 18 average score. And I think it is strange that such trick may change it to about 5-6 score.

spMohanty commented 7 years ago

Woot Woot !!

This issue seems to have been fixed !! And the fix was embarrassingly simple : https://github.com/spMohanty/osim-rl-grader/commit/b1b68c5cff59d056b455d80790737969929d73ee

A BIG shoutout to @kidzik for pointing out the silly bug !!

Some interesting observations :

The precision loss which was the major focus of this discussion was never an issue. Thats apparent by look at the generated video of @LiberiFatali , where the action sequences were collected before ensuring that the grader and replay server get the exact same action sequences.
There is not much divergence actually after or before the precision management of the actions. The cumulative rewards match upto 3 significant digits !!!!! so I guess 11 significant digits for the actions are more than enough :D But in any case, the clever suggestion by @ctmakro to serialise the actions before applying actually theoretically reduces the risk of divergence, even if it theoretically increases the risk of mismatch between performance between local client and server. But practically, it does not actually, because the serialization done before applying the actions was anyway done at the level of the HTTP client.
Also thanks to @syllogismos in helping point out a few interesting bugs while trying to analyse the cause of this problem.

Finally, more W00ts again for the awesome teamwork !!

spMohanty commented 7 years ago

@parilo : Can you check your video now ? If your best solution has a score of 17, then it should have had been regenerated. And as I mentioned in the previous comment, the serialization -> deserialization doesnt actually affect the actual grading results, but it theoretically minimises the risk of divergence. Especially because your agent will adapt to any noise created by floating point loss now during the grading, and the same adaptations will also apply during the replay.

When you reported the issues, the videos were still being regenerated. At least now, all the videos on the top of the leaderboard make much more sense ! Let me know if it makes sense now. Else shoot me an email (sharada.mohanty@epfl.ch) with your crowdAI username, and I will check for divergence for your submission between the grader and your replay server.

Cheers, Mohanty

syllogismos commented 7 years ago

@spMohanty can you make the leaderboard front end to load the videos only when is interested and one clicks on it. currently all the videos load at the same time.. it takes up all the memory in my computer.. and slows everything down.

spMohanty commented 7 years ago

@syllogismos : Agree, I know about this issue. And here the corresponding github issue for this : https://github.com/crowdAI/crowdai/issues/239

Nudging @seanfcarroll too :wink:

PS: The code for the platform is open source, we really love pull requests too 👼

LiberiFatali commented 7 years ago

glad I could help by doing... nothing :).

I knew that the video had only showed one of 3 runs, and since some top videos had ended so soon I assumed that those agents had performed better in remaining 2 runs. Now I see this is a bug.

So does this only affect the visualization (video) ? Is there anything we need to know to train the agent better?

spMohanty commented 7 years ago

@LiberiFatali : No this doesnt affect the client or your training in any way. The videos are being updated on the leaderboard, and you will just get a much better approximation of your scores in the leaderboard videos now.

parilo commented 7 years ago

Firefox may solve your problem with videos in the browser

21 авг. 2017 г. 1:02 ДП пользователь "Adam Stelmaszczyk" < notifications@github.com> написал:

A quick and easy win would be to not display videos, but links to download them.

I believe now lots of people (including me) sees something similar to:

[image: image] https://user-images.githubusercontent.com/733573/29498824-d6125360-8603-11e7-9d75-4e53f460806a.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnmbl/osim-rl/issues/52#issuecomment-323614915, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSSHJ-XzaUSJ9DdeTNXJ8oKkmLPgfrdks5saK0RgaJpZM4O6DYy .

AdamStelmaszczyk commented 7 years ago

Thanks, they indeed show up with Firefox.

(If you are wondering to which message parilo replied: to this one. By mistake initially I posted it in this thread instead of the one in crowdAI, when realized, I removed it from here and posted there, however people with email notifications were instantly informed, sorry).