pytorch / rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
https://pytorch.org/rl
MIT License
2.19k stars 289 forks source link

[Feature Request] Ability to skip transitions based on masks #1079

Open Mayankm96 opened 1 year ago

Mayankm96 commented 1 year ago

Motivation

Certain simulators (such as Isaac Gym and Isaac Sim) create several environment instances as part of the same stage. Unlike CPU-based vectorization, where each environment instance, has its own stepping, in GPU-based parallelized environments, the stepping is common. This leads to an issue when certain environments want to have dummy steps for the scene to come to rest. For instance, we often spawn deformable bodies (such as cloths) from a height and let them fall onto the ground and come to rest before we start learning.

There isn't a clear mechanism to have these dummy steps in the simulators and probably this functionality won't be added since it ties deeply into the whole parallelization philosophy (i.e. all environments are just a huge single GPU data). Thus, this issue would be best handled by the learning framework itself.

Solution

A possible way to deal with this issue could be having the ability to add masking of transition tuples. For instance, if out of N environments, (2, 5, 10) indices were last reset, we enable their masking (additional quantity to "infos" dictionary) for certain time steps. If the masks evaluate to true, then the trajectory or replay buffer doesn't add transitions for those indices into the memory.

Alternatives

I couldn't really think of any other alternatives here but would be interested to hear your thoughts on this.

Additional context

Checklist

vmoens commented 1 year ago

Thanks for posting this issue, certainly worth having a discussion. @matteobettini @btx0424 thoughts?

So if I understand correctly, the main issue is that some steps during the execution of simulations with multiple envs may not be worth storing in the replay buffer. Is there anything more general than that? In the sense: one solution here could be to design a transform for the replay buffer that simply filters out data based on a boolean mask stored in the output tensordict.

If it's more general than that we can also look for a more consistent change in the API.

matteobettini commented 1 year ago

Yeah it seems like the env should output this mask key and the buffer could have a transform that masks data given a str key.

PS @vmoens another example of a util key output by the env which will go in obs_spec.

btx0424 commented 1 year ago

I agree that if the amount of dummy steps is not significant, filtering them out afterwards is a reasonable solution.

But more generally I think this is related to the structure of the buffers in torchrl.

The first type is the rollout buffer allocated by a Collector, it has fixed batch (e.g., num of envs) and time dimensions. And since the underlying environment is stepped synchronously, no saving is possible here, with or without the mask.

The second one is the "real" replay buffers. If we can maintain the temporal structure of the data we have collected in the rollout buffer, and perform filtering upon insertion, it is possible to have only "valid" steps.

IIUC currently torchrl's replay buffers do not keep a structure of [envs, time_steps]. Random sub-sequence sampling is achieved by reshaping before insertion and a RandomCrop applied to the data sampled. This is somehow unnatural because the trajectories become segments where no sequence spanning multiple segments can be sampled.

So I think it might be helpful to have a type of RB (besides existing ones) that keeps [envs, time_steps] so that

  1. we have better control of what to store;
  2. more convenient sub-sequence sampling.
vmoens commented 1 year ago

I'm not 100% sure I understand everything you're saying here so I'll do my best in saying what you can currently do and what you can't:

Storing trajectories

what you can do is store trajectories that you pull from the collector. The user just needs to make sure that

What you cannot do

One thing that is currently not possible it to "extend" an existing trajectory in the buffer. If I call

buffer.extend(data0)
buffer.extend(data1)

and data1 is just a bunch of data that follows data0, there is not way (currently) of sampling a sub-trajectory that spans over data0 and data1. Is that something we want? I'm asking bc IMO one could just ask for "longer" trajectories to the replay buffer if this is needed.

Given this, what I'd like to understand is what is missing for better control over what is being stored, and what would be more convenient for sequence sub-sampling?

btx0424 commented 1 year ago

Apologies for the unclear description.

Yes I meant extending the existing trajectories with data followed in subsequent time steps.

  1. I think what Mayank want is to extend only valid samples into the buffer so that the buffer is more memory-efficient. Say the data returned by iterating the collector have shape [N_env, T], but some of the envs have dummy steps so that the valid data do not all have a consistent length T. The solution I can think of here is to allow extending each row of the buffer independently.
  2. Yes I agree just storing and asking for longer trajectories address the issue to some extent. But there are still "barriers" that a sub-sequence cannot go over, slightly limiting the samples we can generate (e.g., if an episode has length 2T, we cannot generate a sub-sequence from T-k tp T+(L-k) where L is the desired sub-seq length.)

These are really not big problems. It's just when using a GPU-based simulator such as Isaac Sim we sort of want to keep everything on GPU. So with a more limited memory we may want only store valid samples and generate as many samples as possible.

vmoens commented 1 year ago

If I understand the setting, we have N environments and only N (we know that in advance). We want to store data for each env such that we can sample sub-trajectories of any length. Currently we store the data as [M * T] where T is predefined (the time dimension of the collector) and M is the total number of items we can sample from. What we want here is to build a buffer of size [N, L] where L >> T. When we call buffer.extend, we would expect as input a tensordict of size [N, T] and write the following T items along the second dimension for each and every N, is that right?

It's not covered yet but it can be done I guess. Currently we consider the dimension 0 as the one over which we circle, but we could set that to 1. Not sure how to make this work with prioritized sampling though, it's already a mess with trajectories...