lifelong-learning-systems / rlblocks

Reinforcement Learning Blocks for Researchers
MIT License
0 stars 0 forks source link

Task memory agent performs differently on first block #26

Open ginoperrotta opened 2 years ago

ginoperrotta commented 2 years ago

The current implementation of DqnTaskMemory results in different RNG than the other agents, even given the same seed values. This may not be important, but ensuring the agents have identical performance on the first block helps to highlight what the lifelong learning modules do (and don't do).

cash commented 2 years ago

Can you we close this as being fixed?

ginoperrotta commented 2 years ago

I don't think it has been changed since the issue was raised. It can be closed as unimportant if desired, but it is not fixed.

coreylowman commented 2 years ago

I think this is because when the first task starts we initialize a new sampler with a new random seed. https://github.com/lifelong-learning-systems/rlblocks/blob/master/minigrid_crossing_experiment/dqn.py#L217

        key = (task_name, variant_name)
        if key not in self.buffers:
            self.buffers[key] = TransitionDatasetWithMaxCapacity(self.max_size)
            self.samplers[key] = UniformRandomBatchSampler(
                self.buffers[key], seed=self.rng.bit_generator.random_raw()
            )

We do the same thing with the random seed for the normal agent, but i think this code snippet above pulls a 2nd number for the first sampler.

coreylowman commented 2 years ago

A potential fix if we really care about this is to use the already initialized self.replay_buffer, self.replay_sampler for the first task:

if key not in self.buffers:
    self.buffers[key] = ... if self.replay_buffer is not None else self.replay_buffer
    self.samplers[key] = ... if self.replay_sampler is not None else self.replay_sampler
    self.replay_buffer = None
    self.replay_sampler = None    

Edit: Not sure if the above snippet will actually work, self.replay_sampler is still populated with the meta sampler, so the logic may have to be slightly different than shown above