nilscrm / stackelberg-ml

0 stars 0 forks source link

Investigate sample efficiency. #30

Open Angramme opened 4 months ago

Angramme commented 4 months ago

By training on hypothetical world models, it could be that we need less data from the original environment . Does our algorithm actually need less samples than a typical RL on the real world model? Use the following class to compare Yannics algo (which uses policy cumulative reward as reward for the model agent) against just training the policy on the real world samples.

Angramme commented 3 months ago

MAL + Reward

Code in this branch: https://github.com/nilscrm/stackelberg-ml/blob/sample_efficiency/graph.ipynb

I added some config for sample efficiency measurements. The measurement code itself is directly inside train_mal. This can maybe be adapted to other approaches. All in all the sample efficiency does not look great compared to others.

Data was generated by running train_contextualized_MAL with different configs, notably different alpha values for the mixing of agent reward in the model reward and changing the seed. You can see the details in the graph jupyter notebook I linked above.

Angramme commented 3 months ago

Here is a graph with 40 sample points for sample count less than 2_000 (every 25 env samples)