nilscrm / stackelberg-ml

0 stars 0 forks source link

Investigate sample efficiency. #30

Open Angramme opened 4 months ago

Angramme commented 4 months ago

By training on hypothetical world models, it could be that we need less data from the original environment . Does our algorithm actually need less samples than a typical RL on the real world model? Use the following class to compare Yannics algo (which uses policy cumulative reward as reward for the model agent) against just training the policy on the real world samples.

Angramme commented 3 months ago

MAL + Reward

image

Code in this branch: https://github.com/nilscrm/stackelberg-ml/blob/sample_efficiency/graph.ipynb

I added some config for sample efficiency measurements. The measurement code itself is directly inside train_mal. This can maybe be adapted to other approaches. All in all the sample efficiency does not look great compared to others.

Data was generated by running train_contextualized_MAL with different configs, notably different alpha values for the mixing of agent reward in the model reward and changing the seed. You can see the details in the graph jupyter notebook I linked above.

Angramme commented 3 months ago

Here is a graph with 40 sample points for sample count less than 2_000 (every 25 env samples)

image

Angramme commented 3 months ago

added 4 more runs for the first graph for each alpha (so 16 total)

image

then increased the smoothing factor

image

Angramme commented 3 months ago

MAL + Noise

image

So here alpha is the constant mixing of random trajectories as we did before. 100% random performs the best.

Still does not reach max reward with only 20_000 samples. Even less sample efficient than Yannics...

Angramme commented 3 months ago

MAL + Noise + Reward

image

image

Angramme commented 3 months ago

MAL + Noise + Reward + tuned hyperparams

image

image

image