Closed fvgt closed 2 months ago
Glad you got it sorted - I agree that the functions envs.get_environment
vs envs.create
are not the most descriptive in terms of telling you what they're actually doing.
Just between you and me (and the rest of the internet), this whole envs.register
, envs.create
business is a bit of an overwrought abstraction. I've found it simpler to just import the env and instantiate it and wrap it myself.
I was using the sac train.py function, that is available in brax. When I took a look at the full return of the unrolled scan for the evaluation, i.e. I removed the [0] index at the end of this function:
Now the evaluation returns the eval state (that includes the eval metrics that are used for logging, i.e.
eval_state.info['eval_metrics']
), and the data of the full scan. For example, I can look at the full discounts of the episode:My settings were an episode length of 1000, with an action repeat of 1, so the rollout length is 1000 (the first dimension of data), and I used 10 envs (the second dim of the data). Then, I took a look at the discounts:
That made sense on a first glance, the episode terminated after 1000 steps (I was using half cheetah). However, it also terminated in between:
This is an issue, because the summed reward also do not make any sense. For example, If I want to compute the full return of each episode (not discounted), I would use
Compare this with the cumulative reward, that is already computed in the eval metrics:
They are different, but we can easily get them by just computing the reward to the 500th time step:
So I am not sure if that is a bug or if that is intended? If this is intended, what is the reason for this?
Edit:
This bug was on my side. The issue was that I was creating the env using
envs.get_environment(env_name)
instead of using
envs.get_environment(env_name)
which works fine. I think the issue is that the first method of getting the environment already wraps the env. So calling the wrap function again, like it is done in the SAC training pipeline, will create a double wrapping which will lead to unwanted behaviors.