Test accuracy question for IndustReal

isaac-sim / IsaacGymEnvs

Isaac Gym Reinforcement Learning Environments

Other

1.77k stars 389 forks source link

Test accuracy question for IndustReal #182

Closed ismarou closed 7 months ago

ismarou commented 8 months ago

Hello,

During training, the Sampling-based Curriculum (SBC) is initialized at +0.01 (the easiest case) and progressively gets to -0.01 (the hardest case). After training, the policy NN is saved. If we, then, run train.py with test=True and the aforementioned checkpoint NN, we see that the value of the curriculum is not saved at the hardest point of curr_disp_max=-0.01, but is loaded at the easiest one of +0.01.

Is this on purpose? Is this the value of the curriculum that the testing insertion accuracy results were produced with (both in Sim and Real)?

Or should I load the hardest value when testing? Cause when I did so, the testing accuracy drops to ~70%, far away from what is reported on the tables.

Thank you in advance.

Best, Isidoros

bingjietang718 commented 8 months ago

Hi Isidoros,

During testing, the value of the curriculum difficulty need to be set in the IndustRealTaskPegsInsert.yaml in this line.

The curriculum difficulty is not saved with the policy weights. In IndustReal, we run train.py with test=True and task.rl.initial_max_disp= -0.01. The success rate reported in our paper is not an average over all environments for 1 episode, but an average over 1000 episodes. In real world, the policy was tested with height randomization (the distance from the plug bottom to the top of the socket) of [10, 20] mm, which is even harder than the most difficult curriculum setting. The real world randomization parameters are listed in the paper appendix TABLE XII.

Best, Bingjie

ismarou commented 8 months ago

Hi Bingjie,

First of all, thank you for the clarification.

Does that mean that this average over 1000 episodes (btw, do you do it for the same seed or a different one per episode?.) is also conducted in training, i.e. you modified the RLGames library for that? Because if you run train.py and select the policy with the highest insertion rate in the hardest curriculum using the public RLGames library, the training results cannot be generated in a new testing scenario, again, with the hardest curriculum, as the RLGames library picks the best policy based on one average sampling, not multiple.

bingjietang718 commented 8 months ago

Hi Isidoros,

We did not modify RL Games. RL Games will save the policy at the highest reward (which not necessarily corresponds to the highest success rate), since our reward is scaled based on the current curriculum difficulty, the saved policy weights with the highest reward corresponds to the highest curriculum difficulty.

Best, Bingjie

ismarou commented 8 months ago

Hi Bingjie,

Oh, I see. But again, you save the policy with the highest (modified) reward for one (averaged over all envs) sample-episode, right? So, when you say in the paper that you average over multiple seeds, you train for one and test on the multiple ones?

Best, Isidoros

bingjietang718 commented 8 months ago

Hi Isidoros,

When we run the evaluation for IndustReal in simulation, for the saved highest reward policy, we evaluate with 5 different seeds in total, for each seed we evaluate with 1000 episodes.

Best, Bingjie

ismarou commented 8 months ago

Hi Bingjie,

Ok, I believe this clarifies things. Greatly appreciated.

Best, Isidoros