Request for comments: #38 Adding new replay buffer

coreylowman commented 2 years ago

Wanted to get some comments on this before I start integrating it into our agents

coreylowman commented 2 years ago

@ginoperrotta i'll try integrating the coverage priority in a bit and post. The main thing is it just needs access to the buffer to sample other transitions right? Might make sense to give priority function in DropLowestPriorityTransition access to the buffer... maybe something like

class PriorityFn:
    def get_priority(self, buffer: RLBuffer[T], item_id: int) -> float:
        ...

?

coreylowman commented 2 years ago

Also another comment/question for @ginoperrotta i feel like we should somehow enable static priority functions (e.g. transition age) that are assigned when a transition is added vs dynamic priority functions (e.g. coverage maximize/random) that are calculated repeatedly. THoughts?

ginoperrotta commented 2 years ago

Adding in writing a few thoughts that we covered in our meeting:

Random and coverage priorities are computed when a sample is added to the buffer just like the rest of the methods. It could be re-computed dynamically, but can save some computation with little drawback by computing once.

The updated buffer system may be more flexible, but it is also more confusing. We should also implement typical cases (fifo, reservoir) as their own ready-to-go "blocks."

ginoperrotta commented 2 years ago

I pushed the surprise metric replay to the working design branch you made. https://github.com/lifelong-learning-systems/rlblocks/commit/a561352dc2ee7829d6fbbda06e0891d3cc1a06ee

It is a very wasteful implementation since all of the priority calculations require extra network calls rather than updating when the samples are used in the network already. Does the refactor here permit a better version of this?

That isn't a make-or-break factor on this PR, but it has been my litmus test on the discussed changes so far.

coreylowman commented 2 years ago

@ginoperrotta

This still seems to be using the old PriorityFn instead of new replay_v3.PriorityAssigner. I don't expect much difference between the two implementations though.

In order for it to not be wasteful, i think the output of the network would have to be added as data to the transitions. Then you could do transition.q_values instead of invoking network. However I think that would add a lot of other complexity as we've noted with custom transition subclasses

coreylowman commented 2 years ago

Also missed that you are updating priorities after updating models. I think with the new RLBuffer this would be done via adding a new field to the transition to hold the priority. Then you could call update_items method but you'll have to override the default update_models and set the criterion reduction to none:

        self.dqn_loss_fn = DoubleQLoss(
-            self.model, self.target_model, criterion=nn.SmoothL1Loss()
+            self.model, self.target_model, criterion=nn.SmoothL1Loss(reduction="none")
        )

    def update_model(self, n_iter: int = 4):
        for _ in range(n_iter * self.num_envs):
            batch = collate(self.replay_sampler.sample_batch(batch_size=64))
            batch.reward += self.reward_signal_per_step
-           loss = self.compute_loss(batch)
+           unreduced_loss = self.compute_loss(batch)
+           buffer.update_items(batch.ids, {"surprise": unreduced_loss.abs()})
            self.optimizer.zero_grad()
            unreduced_loss.mean().backward()
            self.optimizer.step()

I think right now the batch.ids field is not populated either... may need to figure out changes for that

coreylowman commented 2 years ago

Adding field to transition for surprise seems way more complicated than is needed in the current version. It would enable reusing that surprise field for another component (e.g. prioritized experience sampler could use surprise field) though. Skeptical that that is a price we are willing to pay

coreylowman commented 2 years ago

I think the more we discuss this the more I feel like we should move toward user just defining their own buffer lol. There's so many one off cases and slightly different ways that supporting all of them is seeming too complex to be user friendly or easy to maintain

cash commented 2 years ago

In a situation like this, I tend to provide a built in buffer that they can use for the 60% or 80% of use cases while making it clear how they can define their own.

ginoperrotta commented 2 years ago

Yeah, I wrote the inefficient version of surprise priority in the old format to show it works. I like your suggested path to the better implementation.

So far, do you feel that the refactor is adding capability or simplifying use over the previous version?

coreylowman commented 2 years ago

I think this refactor (maybe) simplifies the implementation of some of the components, and makes future components easier to implement (surprise via updating fields, prioritized sampling).

I don't think it makes it easier for the user unless we add that additional layer you brought up yesterday to provide some general buffers.

coreylowman commented 2 years ago

@ginoperrotta I think cash makes a good point, and now i'm wondering, does the existing buffer implementation in master already cover 60-80% of use cases? Do we need to redesign this if we are going for that coverage?

Maybe all we need to do is show how a user can define their own buffer if the one in master doesn't meet their needs.

lifelong-learning-systems / rlblocks

Request for comments: #38 Adding new replay buffer #41