Confused about the Implementation of QREPS Agent

Hey, I am confused about the dual implementation in the QREPS agent. The code I am talking about is in qreps_algorithm in the QREPS class. To be specific:

 # Calculate weights.
weights_td = self.eta() * td  # type: torch.Tensor
if weights_td.ndim == 1:
    weights_td = weights_td.unsqueeze(-1)
dual = 1 / self.eta() * torch.logsumexp(weights_td, dim=-1)
dual += (1 - self.gamma) * value.squeeze(-1)
return Loss(dual_loss=dual.mean(), td_error=td)

As far as I understand, the last dimension in weights_td is always added and then the logsumexp operation does nothing. Maybe, you can help me in understanding this or maybe there are changes between the version of the paper and the implementation visible here.

The current implementation seems to perform only good with the fixed seed 0. When setting any other seed the learning breaks down completely.

I hope you can guide me in understanding this.

sebascuri / qreps

Confused about the Implementation of QREPS Agent #2