Closed ViktorM closed 7 years ago
What you suggest is basically "the observation does not represent the state". Please consider reading https://www.crowdai.org/topics/tutorial-getting-the-skeleton-to-walk/comments/new(require login) about the "Markovianess" of the observation. There are a lot of tricks to convert a series of observations into states.
The only problem now is the agent can become "blind": by putting a very small obstacle in front of him, he will not be able to see every huge obstacles behind, which might just be located centimeters further than the smaller one.
Regarding the past obstacles, indeed there are ways to keep this in memory as @ctmakro suggested. At the very least, one can transform observations manually in some pre-processing step.
There is a problem of blindness, but let's just treat it as a part of the environment... It's a little bit like walking in the dark -- the agent should learn to be careful :)
All in all, it does not seem to be an issue big enough for changing the environment in the last month of the challenge.
At the moment we get next obstacle position and radius as observations: https://github.com/stanfordnmbl/osim-rl/blob/master/osim/env/run.py#L110
I can see a few potential issues with this approach that can affect learning:
1) When the agent walks not too well and learns how to surpass an obstacle it it's body can swing and pelvis position X oscillate back and forth a bit above the sphere and next obstacle observation can change a lot during this waving, jumping from the current sphere position to the next sphere position a few times which is not very good for training and it's not good to have such a large jumps in observation values in general, when character position doesn't change a lot.
2) When the pelvis X coordinate becomes larger than current obstacle position the observations jump to the next one, and agent "forgets" about obstacle which he just passed, but one leg can still be behind this old obstacle and the fact that the agent is missing information about the position and radius of this obstacle also doesn't help learning good locomotion policy.
It can be potentially a quite large and breaking change but may be observation received about obstacles can be updated? For example send information about 2 nearest obstacles?