luwo9 / bomberman_rl

Reinformcement learning for Bomberman: Machine Learning Essentials lecture 2024 final project
0 stars 0 forks source link

Strategies Overview #4

Open luwo9 opened 3 months ago

luwo9 commented 3 months ago

An overview over all techniques/strategies mentioned:

  1. Policy exploration/exploitation
    • $\epsilon$-greedy
    • Softmax
  2. Update Q function
    • SARSA
    • (k-step) temporal difference
  3. Update Q function based on batch
    • All games (only few from last game)
    • Look where Q differs from return
  4. Sparse Q
    • Exploit symmetries/ data augmentation
      • Equivalent state $\Rightarrow$ equivalent action
    • Feature engineering
    • Regression
      • Vector/Scalar output
      • OLS, RF, NN
      • SARSA etc.
  5. Reward shaping during training
    • $\tilde r{t+1} = r{t+1} + \psi(s{t+1})-\psi(s{t})$
    • First and last have no shaping.
    • Don't reward actions
RuneRost commented 3 months ago

we discussed: rewarding events should be fine as those can be seen as a difference between previous and current state (e.g. bomb placed is seen as the difference betweeen a bomb being present at coordinates x,y in state t and not being present at coordinates x,y at state t-1)