sjtu-marl / malib

A parallel framework for population-based multi-agent reinforcement learning.
https://malib.io
MIT License
498 stars 60 forks source link

Ugly implementation of evaluation control #17

Closed KornbergFresnel closed 2 years ago

KornbergFresnel commented 3 years ago

The changes from PR #12 added a new feature to do policy evaluation, while it was ignored. Polishment is required

settings https://github.com/sjtu-marl/malib/blob/9efda1b6c9db555f1360aa9759a717349c1fe32d/malib/settings.py#L95

parameter https://github.com/sjtu-marl/malib/blob/9efda1b6c9db555f1360aa9759a717349c1fe32d/malib/rollout/rollout_worker.py#L38

some related logics https://github.com/sjtu-marl/malib/blob/9efda1b6c9db555f1360aa9759a717349c1fe32d/malib/rollout/base_worker.py#L162 https://github.com/sjtu-marl/malib/blob/9efda1b6c9db555f1360aa9759a717349c1fe32d/malib/rollout/base_worker.py#L104 https://github.com/sjtu-marl/malib/blob/9efda1b6c9db555f1360aa9759a717349c1fe32d/malib/rollout/base_worker.py#L315

zbzhu99 commented 3 years ago

Actually, I have some personal thoughts about the evaluation worker, which should work quite differently from normal rollout workers.

First, the evaluation worker should not sample data in asynchronous manners, which will cause a waste of computing resources. Instead, it is supposed to wait for the training manager to send signals along with policy parameters to be evaluated. Maybe it has to maintain a local parameter buffer to handle the case when new signals coming in during evaluation time.

Second, the information received from the training manager should contain the corresponding (training) epoch number, so the evaluation worker can log the evaluation metrics with the training epoch rather than the evaluator's local sample epoch.

Thank you!