Open Angramme opened 4 months ago
Code in this branch: https://github.com/nilscrm/stackelberg-ml/blob/sample_efficiency/graph.ipynb
I added some config for sample efficiency measurements. The measurement code itself is directly inside train_mal. This can maybe be adapted to other approaches. All in all the sample efficiency does not look great compared to others.
Data was generated by running train_contextualized_MAL
with different configs, notably different alpha values for the mixing of agent reward in the model reward and changing the seed. You can see the details in the graph jupyter notebook I linked above.
Here is a graph with 40 sample points for sample count less than 2_000 (every 25 env samples)
added 4 more runs for the first graph for each alpha (so 16 total)
then increased the smoothing factor
So here alpha is the constant mixing of random trajectories as we did before. 100% random performs the best.
Still does not reach max reward with only 20_000 samples. Even less sample efficient than Yannics...
By training on hypothetical world models, it could be that we need less data from the original environment . Does our algorithm actually need less samples than a typical RL on the real world model? Use the following class to compare Yannics algo (which uses policy cumulative reward as reward for the model agent) against just training the policy on the real world samples.