Basepolicy requires replay buffer restricting modularity.

Not really a coding issue. You assert in your paper that the code base is highly modular but the algorithms are very strongly tied to your implementation of a replay buffer. All of the static methods in the base policy require the replay buffer.

It would be non-trivial to decouple the algorithmic implementations from the replay.

I would like to recommend that you rename learn to _learn. learn without the underscore implies that it is a stand alone method that can be publicly called but, at least for DDPG and child classes, learn requires everything surrounding it in update.

if buffer is None:
    return {}
batch, indices = buffer.sample(sample_size)
self.updating = True
batch = self.process_fn(batch, buffer, indices)
result = self.learn(batch, **kwargs)
self.post_process_fn(batch, buffer, indices)
if self.lr_scheduler is not None:
    self.lr_scheduler.step()
self.updating = False
return result

The docstring of learn is also not indicative of this limitation.

thu-ml / tianshou

Basepolicy requires replay buffer restricting modularity. #898