rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.
MIT License
1.88k stars 310 forks source link

Benchmark test for each algo #234

Open ryanjulian opened 6 years ago

ryanjulian commented 6 years ago

We need a long-running benchmark test for every algorithm. It should run on the hardest problems that algorithm can solve in <1M timesteps. In most cases, the Mujoco1M suite from OpenAI gym is appropriate.

2 cases:

  1. There exists a high-quality baseline implementation of the algorithm in openai/baselines or elsewhere Then run the baseline algorithm in parallel with our algorithm and verify that we achieve at least the baseline performance for each task with the same rise time
  2. There does not exist a high-quality baseline implementation of the algorithm Run our algorithm the appropriate number of steps for each problem and verify that it achieves the expected final reward with the expected rise time.
cheng-kevin commented 5 years ago

@ryanjulian for case 2, what are we considering the "expected final reward with the expected rise time" ?

ryanjulian commented 5 years ago

@CatherineSue can you help explain?

cheng-kevin commented 5 years ago

Screenshot from 2019-06-03 14-44-14

Are there other "high qualities implementations" other than rlkit, baselines, tf_agents. These do not have implementations of ERWR, REPS, TNPG, VPG.

ryanjulian commented 5 years ago

I think that list is basically my universe of "high quality implementations". you might want to check out stable-baselines

The list you made is of more "classic" RL algorithms which explains the lack of other implementations. We should be happy to duplicate others' results from papers for those.

CatherineSue commented 5 years ago

For case 2, if there are no high-quality implementations we can compare to, we can compare to the performance in the papers for those algorithms. Usually, there will be experiments and their final reward metrics in the papers.

ryanjulian commented 4 years ago

So I think the clear path to closing this is similar to our strategy for examples -- We can't run each example in full on the CI path, but we can ensure that each algorithm has a relevant benchmark script, and that script can be executed (for perhaps a single epoch).

This ensures that those benchmarks scripts can always be run out-of-band from the CI, even when code changes.