thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.76k stars 1.12k forks source link

What are the exact meanings of epoch and step_per_epoch? #151

Closed familyld closed 4 years ago

familyld commented 4 years ago

In supervised learning, the meanings of epoch and step_per_epoch (# of data samples/batch_size) are clear, but in reinforcement learning, it seems vague since we don't have a fixed dataset. Some algorithms are trained per episode, e.g., REINFORCE and some are trained per step, e.g., DQN. As a result, I am quite confused about the exact meanings of epoch and step_per_epoch here. Would you please better explain the connections between these concepts?

Besides, I am also interested in how you evaluate algorithms. Do you evaluate them every few learning steps or something else? I see the number of test episodes is 100 as one of the arguments but what is the evaluation frequency?

Thank you.

Trinkle23897 commented 4 years ago

In supervised learning, the meanings of epoch and step_per_epoch (# of data samples/batch_size) are clear, but in reinforcement learning, it seems vague since we don't have a fixed dataset. Some algorithms are trained per episode, e.g., REINFORCE and some are trained per step, e.g., DQN. As a result, I am quite confused about the exact meanings of epoch and step_per_epoch here. Would you please better explain the connections between these concepts?

I think @danagi 's definition in #108 is quite clear: https://github.com/thu-ml/tianshou/pull/108#issuecomment-653433518

A global step or an algorithmic step, consists of collect step and update step. In other words, 
an epoch with n steps that has m collect steps and t update steps, would collect total n*m frames 
and update network total n*t times.

But currently, the step means a policy network update. You can refer to https://tianshou.readthedocs.io/en/latest/api/tianshou.trainer.html#tianshou.trainer.offpolicy_trainer

Besides, I am also interested in how you evaluate algorithms. Do you evaluate them every few learning steps or something else? I see the number of test episodes is 100 as one of the arguments but what is the evaluation frequency?

Currently, it evaluates the algorithms when it reaches the given reward threshold in the training phase. In other words, if the collector collects some episodes by the current policy in the training phase, and it finds that the mean reward of these episodes is above the threshold (this is a prior), it will test this policy using 100 episodes to see if it is really above the reward threshold.

familyld commented 4 years ago

@Trinkle23897 Thanks for your quick response. I read the demo 'test/discrete/test_dqn.py' and get a better understanding of it. 'step-per-epoch' means a global step and 'collect-per-step' means how many frames to be collected in one global step. However, the 'update_per_step' is not explicitly passed from this 'test_dqn.py' but found in 'offpolicy.py' as a parameter set to be 1.

Trinkle23897 commented 4 years ago

You are right. update_per_step is related to #99 and #108. I would like to change the code to align with this definition.

familyld commented 4 years ago

Cool.