Closed nilscrm closed 3 months ago
First analysis on 10 random world models. Pre-trained our policy and world model specific PPO models for 10000 training steps.
Scores shown are average reward and standard deviation.
Evalutaing random model 0
Env specific baseline: (3.695499999038875, 4.253122940104274)
Our contexualized policy: (2.8959999991953373, 3.1410323137797653)
Evalutaing random model 1
Env specific baseline: (2.9894999964162707, 2.749984498417424)
Our contexualized policy: (1.6679999989271164, 1.3084823265621541)
Evalutaing random model 2
Env specific baseline: (3.7824999985471366, 4.530073811340347)
Our contexualized policy: (3.7254999984428285, 4.145826182958247)
Evalutaing random model 3
Env specific baseline: (1.3454999996349215, 1.120560016135359)
Our contexualized policy: (1.2774999995157124, 0.7351998026641404)
Evalutaing random model 4
Env specific baseline: (1.2729999998956918, 0.6744412501507899)
Our contexualized policy: (1.2294999998435379, 0.5805211020523027)
Evalutaing random model 5
Env specific baseline: (2.220999999716878, 2.8344415675050736)
Our contexualized policy: (2.0429999995976686, 1.778342205534533)
Evalutaing random model 6
Env specific baseline: (0.9274999987706543, 0.14201672535812704)
Our contexualized policy: (0.9344999988749624, 0.15193666565525063)
Evalutaing random model 7
Env specific baseline: (1.4484999974444508, 0.9524299181292879)
Our contexualized policy: (1.4809999974817039, 1.1757929238394074)
Evalutaing random model 8
Env specific baseline: (4.765499996505678, 6.591038212985427)
Our contexualized policy: (2.209499999396503, 3.618215685834796)
Evalutaing random model 9
Env specific baseline: (1.1604999993368983, 0.992806501760879)
Our contexualized policy: (0.9429999991506338, 0.08000625094806446)
Will investigate the differences more.
Model learns to play optimally in all models now.
Evaluate the pre-trained policy model on various environments and see if it is optimal.