pytorch / rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
https://pytorch.org/rl
MIT License
2.24k stars 296 forks source link

[Feature Request] Examples Suggestion #861

Closed smorad closed 7 months ago

smorad commented 1 year ago

Motivation

This is somewhat related to https://github.com/pytorch/rl/issues/849. I realize TorchRL is quite new and I expect the documentation and examples will improve over time, but I believe the current examples are a bit difficult to follow for newcomers like myself. It is very temping to briefly scan the TorchRL examples and think "this is too complex, I'll go use CleanRL/StableBaselines". The current examples seem to fall into two buckets:

  1. Abstracting everything
  2. Everything from scratch

For type 1 examples, there seems to be a lot of extra setup/config that clutters the example (example). IMO while things like observation normalization and reward scaling are useful, I don't think they belong in the simple examples. Perhaps as separate entries in some "advanced examples" category. Many of the type 1 examples seem like they were intended as configurable algorithm implementations, making them abstract and harder to follow. I think it might be better to call these "algorithm implementations" instead of "examples".

On the other extreme, we have type 2 examples (example) which are bare metal. I find these more useful than type 1, but they do not demonstrate all the cool things in TorchRL. For example, we explicitly compute the loss instead of drawing from your large set of loss function implementations!

Solution

TL;DR: Env normalization and configuration code should be removed from basic examples. Rename torchrl/examples to algorithm_implementations. A third class of examples that utilize TorchRL methods while still giving the user loop control.

I think that some of the researchers using TorchRL would benefit from a third category of examples, somewhere in between the first and second categories. I think these examples should make use of all the useful functions of TorchRL while still leaving the train loop up to the user. I personally would also prefer if these called constructors instead of using helper functions (e.g. TensorDictModule vs make_dqn_model). That way, we can inherit to implement our own custom model/loss/buffer/etc. E.g.

env = ...
model = TensorDictModule(nn.LazyLinear(env.action_space))
train_policy = QValueActor(model)
explore_policy = EGreedyWrapper(train_policy)
opt = ...
buffer = ...

for epoch in range(100):
  # Collect
  data = env.rollout(...)
  buffer.insert(data)

  # Train
  train_data = buffer.sample()
  train_policy(train_data)
  loss = DQNLoss(model, train_data)
  loss.backward()
  logger.log(train_data)
  opt.step()
  explore_policy.load_state_dict(train_policy.state_dict())

Checklist

cc @matteobettini as I think he has some thoughts too.

vmoens commented 1 year ago

Thanks for this! I really appreciate the feedback. Do you think a "getting started" section in the doc with tutorials that you could accomplish in - say - 2h would satisfy this?

vmoens commented 1 year ago

Also: I'm happy to rename the examples, and/or to move them to a separate repo. @giadefa @albertbou92 @BY571 might have some thoughts to share about this.

I was thinking of doing a getting started like this:

 Getting started with torchrl

Each "chapter" would be a separate notebook that can be read through in 10-20 mins. Must be executed on CPU.

Questions

vmoens commented 1 year ago

cc @NKay5

smorad commented 1 year ago

Perhaps "getting started" should be even simpler for pea-brains like myself. Here's the one from coax, which I really like

import os
import gym
import jax
import coax
import haiku as hk
import jax.numpy as jnp
from optax import adam

# the name of this script
name = 'dqn'

# env with preprocessing
env = gym.make('PongNoFrameskip-v4', render_mode='rgb_array')  # AtariPreprocessing will do frame skipping
env = gym.wrappers.AtariPreprocessing(env)
env = coax.wrappers.FrameStacking(env, num_frames=3)
env = gym.wrappers.TimeLimit(env, max_episode_steps=108000 // 3)
env = coax.wrappers.TrainMonitor(env, name=name, tensorboard_dir=f"./data/tensorboard/{name}")

def func(S, is_training):
    """ type-2 q-function: s -> q(s,.) """
    seq = hk.Sequential((
        coax.utils.diff_transform,
        hk.Conv2D(16, kernel_shape=8, stride=4), jax.nn.relu,
        hk.Conv2D(32, kernel_shape=4, stride=2), jax.nn.relu,
        hk.Flatten(),
        hk.Linear(256), jax.nn.relu,
        hk.Linear(env.action_space.n, w_init=jnp.zeros),
    ))
    X = jnp.stack(S, axis=-1) / 255.  # stack frames
    return seq(X)

# function approximator
q = coax.Q(func, env)
pi = coax.EpsilonGreedy(q, epsilon=1.)

# target network
q_targ = q.copy()

# updater
qlearning = coax.td_learning.QLearning(q, q_targ=q_targ, optimizer=adam(3e-4))

# reward tracer and replay buffer
tracer = coax.reward_tracing.NStep(n=1, gamma=0.99)
buffer = coax.experience_replay.SimpleReplayBuffer(capacity=1000000)

# DQN exploration schedule (stepwise linear annealing)
epsilon = coax.utils.StepwiseLinearFunction((0, 1), (1000000, 0.1), (2000000, 0.01))

while env.T < 3000000:
    s, info = env.reset()
    pi.epsilon = epsilon(env.T)

    for t in range(env.spec.max_episode_steps):
        a = pi(s)
        s_next, r, done, truncated, info = env.step(a)

        # trace rewards and add transition to replay buffer
        tracer.add(s, a, r, done or truncated)
        while tracer:
            buffer.add(tracer.pop())

        # learn
        if len(buffer) > 50000:  # buffer warm-up
            metrics = qlearning.update(buffer.sample(batch_size=32))
            env.record_metrics(metrics)

        if env.T % 10000 == 0:
            q_targ.soft_update(q, tau=1)

        if done or truncated:
            break

        s = s_next

    # generate an animated GIF to see what's going on
    if env.period(name='generate_gif', T_period=10000) and env.T > 50000:
        T = env.T - env.T % 10000  # round to 10000s
        coax.utils.generate_gif(
            env=env, policy=pi, resize_to=(320, 420),
            filepath=f"./data/gifs/{name}/T{T:08d}.gif")

This "getting started" will get them hooked because it is concise, it is clear what each line is doing, and it leaves room for the user to modify the script as they see fit.

I find learning by example works well for me, not sure about others. I think your proposed "getting started" would probably read best as separate chapters after the user has looked at a few examples. They come after the examples, but before you get to reading the autogenerated API docs.

Creating an environment # a concrete example of loading an environment from one of the supported libs, + transforms + parallel execution

Separate chapters/sections on envs/specs, transforms, rollouts/collectors, replay buffers, etc. These pieces are pretty modular and algorithm-agnostic, so I think it is ok to separate these. Then perhaps a chapter on "advanced sampling" where you put everything together and demonstrate all the bells and whistles in a single script.

Writing your model # how to use TensorDictModule or SafeModule, how to write a probability distribution (?)

These sound good as separate chapters.

Collecting and storing data # how to use the data collectors and replay buffers]

Yes, this could be folded in with the env stuff

writing your loss module

Yes! Separate chapter.

Recording and logging

Yes! Separate chapter.

Putting things together

If things are modular, this shouldn't really be necessary after the user has read through the "getting started".

Each "chapter" would be a separate notebook that can be read through in 10-20 mins. Must be executed on CPU.

I think if you make these separate from "getting started" examples, you can go a bit more complex. You can write a paragraph or two about what exactly a collector is. You can have 3-4 separate code blocks that are different types of collectors, or for different situations (e.g. multiagent).

should this "getting started" be oriented towards a specific goal, like "at the end of this getting started, you should be able to code DDPG" or smth like that. Getting started would just show how to code one specific algo.

My proposed "getting started" from above has a goal: Write the simplest-possible agent being trained on a single specific environment using a for-loop, giving the users the high-level structure of what their code should look like. I defer to the coax getting started which I find quite readable and concise. After reading a DQN, DDPG, and PPO example the user should have a rudimentary idea of how collectors, replay buffers, TensorDictModules, etc work.

Other option: each "chapter" of getting started is independent (IMO this makes it harder to follow)

I think once the user reads the "getting started" and gleans some structure/naming conventions (e.g. collector == rollout worker set), they are better equipped to select and read specific chapters. I imagine a typical user might leave ~80% of the "getting started" example as is, and read whichever chapter they are interested in messing with (e.g. collectors, models, distributions, etc).

Other option: getting started goes over multiple routes (ie pixels vs state-base, etc)

For "getting started", maybe have one algorithm use pixels? Otherwise I think state should be sufficient. Let them grasp the main idea, then throw all the extra tools at them. I don't think you want to overload them with everything TorchRL can do yet.

I would see this as super "concrete", ie getting a hands-on knowledge of a specific problem. If you feel like starting with more general knowledge like "what is tensordict?" might be useful i'm open to it, but I feel this kind of stuff might be like explaining what is electricity to someone who wants to learn how to use a toaster...

Yeah you need to set some prerequisites. Perhaps assume users have read Sutton & Barto and know what Q learning and policy gradient are, as well knowledge of nn.Module. To me, TensorDict seems to be the most straightforward part of TorchRL. I don't think you need to spend too much time on it, just basics and for the rest link to the TensorDict library docs.

matteobettini commented 1 year ago

Yes for me it is a +1 to @smorad's comment. I think what we currently miss is a short notebook (1 page or so) which a user that has never seen torchrl before could glimpse at as a first approach and understand the most important parts of the sampling-training pipeline in a few minutes.

Something close to the example that Steven proposed where you have 2 lines of intialising an env, parallel env and a step_counter wrapper. A few lines to make the actor and critic model parts and then the loop with the loss (that can be switched out for other losses (eg. PPO for DDPG)).

I would also prefer bare classes instead of heper functions as those make you wonder what logic are they actually wrapping.

vmoens commented 1 year ago

We have smth like that with dqn Nd ddpg

https://pytorch.org/rl/tutorials/coding_ddpg.html

https://pytorch.org/rl/tutorials/coding_dqn.html

Are you saying they're too elaborate?

matteobettini commented 1 year ago

These 2 are the first two notebooks that everyone from our lab looked at when we first got intresested in the library. And to us they seemed a little daunting (cause of length and detail level).

There are in depth sections for each component but maybe as the very first approach something short that can be read in 2 mins like the coax one might be better.

Those two are still helpful to have fot further understanding

vmoens commented 1 year ago

Got it!

Let's identify a simple problem that can be nailed in a few lines of code then!

matteobettini commented 1 year ago

just boiling down something like those two to the bone might be an option. Where just the simplest env is chosen with a simple transform and all hyperparameter and training choices are already made.

Yeah for example a bare bone cartpole ppo not using pixels

vmoens commented 1 year ago

Cool Another thing in these two tutorials we have is that I was told that "loss modules are fine but for those who don't know anything about RL they look like black boxes" so I unpacked the content in the script. Now it what we want is to have a view on what the code base can do I might use them, but at the cost of having this sort of black box

matteobettini commented 1 year ago

In this bare bone version my personal opinion would be to use them but maybe not as abstract as using the make_trainer https://github.com/pytorch/rl/blob/main/examples/dqn/dqn.py#L153

vmoens commented 1 year ago

Good! Will do! Thanks for the suggestion guys! Wonderful to see such actionable feedback, really improves things

vmoens commented 1 year ago

@matteobettini @viktor-ktorvi @smorad @shagunsodhani @NKay5

I have written a PPO tutorial

https://deploy-preview-2156--pytorch-tutorials-preview.netlify.app/intermediate/reinforcement_ppo.html

Wdyt? Is it simple enough? It's roughly the same length as the DQN tutorial. It is going to be the first of the series of tutorials.

matteobettini commented 1 year ago

Looks very good to me! This will help a lot

I wonder if alongiside this we could accompany a TL;DR version of the same tutorial, e.g.:

# Env
env = TransformedEnv(
     GymEnv(
        "InvertedDoublePendulum-v4", 
         device=device,
         frame_skip=frame_skip
    ),
    Compose(
        # normalize observations
        ObservationNorm(in_keys=["observation"]),
        DoubleToFloat(in_keys=["observation"], ),
        StepCounter(),
    )
)

# Policy
policy_module = TensorDictModule(
     nn.Sequential(
              nn.LazyLinear(num_cells, device=device),
              nn.Tanh(),
              nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),
              NormalParamExtractor(),
     ),
     in_keys=["observation"],
     out_keys=["loc", "scale"]
    )

# Value function 
value_module = ValueOperator(
    module= nn.Sequential(
        nn.LazyLinear(num_cells, device=device),
        nn.Tanh(),
        nn.LazyLinear(1, device=device),
    ),
    in_keys=["observation"],
)

# Collector
collector = SyncDataCollector(
    env,
    policy_module,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
    split_trajs=False,
)

# Replay buffer
replay_buffer = ReplayBuffer(
    storage=LazyTensorStorage(frames_per_batch),
    sampler=SamplerWithoutReplacement(),
)

# Loss
advantage_module = GAE(
    gamma=gamma, lmbda=lmbda, value_network=value_module,
    average_gae=True
)

loss_module = ClipPPOLoss(
    actor=policy_module,
    critic=value_module,
    advantage_key="advantage",
    clip_epsilon=clip_epsilon,
    entropy_bonus=bool(entropy_eps),
    entropy_coef=entropy_eps,
)

optim = torch.optim.Adam(loss_module.parameters(), lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optim,
    total_frames // frames_per_batch,
    0.0
    )

# Training
logs = defaultdict(list)
for i, data in enumerate(collector):
    for k in range(num_epochs):
        advantage_module(data)
        data_view = data.reshape(-1)
        replay_buffer.extend(data_view)
        for j in range(frames_per_batch // batch_size):
            subdata, *_ = replay_buffer.sample(batch_size)
            loss_vals = loss_module(subdata)
            loss_value = loss_vals["loss_objective"] + loss_vals[
                "loss_critic"] + loss_vals["loss_entropy"]

            # Optim
            loss_value.backward()
            torch.nn.utils.clip_grad_norm_(
                loss_module.parameters(),
                max_grad_norm
                )
            optim.step()
            optim.zero_grad()

    collector.update_policy_weights_()
    scheduler.step()

collector.shutdown()
del collector

This could allow users to see a one page recap of the contents that they can capy and paste in one action.

P.S. I think you might have left a comment that belongs to another tutorial: in the epoch iteration you talk about the mask and padding but we are not splitting trajs

viktor-ktorvi commented 1 year ago

Huge improvement in understandability!

I've written down my thoughts from last night while going through the tutorial and I think this will give you an insight into how someone whos 1st contact with the library was two days ago is thinking, which is the tutorials target audience, I'd say.

Intro

Great, understandable.

Well done.

Hyperparameters

  • Not sure about the meaning of frame_skip.

I think this is because I'm unfamiliar with the terminology that's probably being used in Gym. I'm still not a 100% sure about the relationship between total_frames, frames_per_batch and frame_skip so that might be a source of confusion which is the users fault but I'm not sure.

Normalization

  • What do in_keys do?
  • What types are specs? Is that all I need to provide when making an Env? How do I make an Env exactly?

I'm now pretty sure that in_keys index TensorDicts. The second point goes to show that your idea of making a tutorial on how to make an env is on point.

Policy

  • Why does the NN output 2D_action? What's loc and scale? Ooooh it's the mean and std of the Gaussian, is it? Took me too long to get that. Now I get why it's 2D_action.
  • Back to in and out keys. Is the user free to name those keys and is just responsible for keeping the usage consistent or do they have to be named like that?
  • I like how the "talking" through the TensorDictModule was explained.

Should have gotten this from the algorithm itself but it did make me confused for a couple of minutes there.

Data collector

  • I'm having a bit of trouble understanding frames_per_batch. I guess I just need to study the specific terminology.
  • Not sure what the collector does. It somehow manages the envoronment. What is it's relation to a DataLoader? Is there an analogy?

Same issues again. Reading through the tutorial again I think I see where the issue is. I'm having trouble differentiating between batch in the context of Data Collector and batch in the context of sampling from a replay buffer; the second one is clear to me and I don't quite get what the first concept represents. Maybe it's my lack of terminology knowledge.

Replay buffer

  • Not sure what the storage part specifies.

I see that it's not too important but it tickles my interest to know more about it.

Training loop

  • Not sure what for i,data in enumerate(collector): represents. Is it one timestep for the env. No, it's gotta be like multiple timesteps. What's the shape of data?

  • for k in range(epochs) is how many update steps to take

  • for j in range(frames_per_batch // batch_size): is looping through the batch

  • I like that the usage of the loss module is shown.

  • In the # Optim part I guess that's gradient clipping; I'm not too familiar with that. It should be explained if it's part of the algorithm or if that's an optional thing to do just to avoid confusion.

  • then I see some optional logging

  • then evaluation. I have to see what rollout does exactly; I guess it steps through the environment n times.

  • collector.update_policyweights() has me a bit confused. I thought we updated the weights already with optim.step()

This part is pretty good as well, there's just the confusion from before.

Those were my thoughts from last night. I see there were some changes made afterwards but the gist of it is probably the same. I hope this gives the perspective of a new user.

Overall great improvement! I really appreciate it!

vmoens commented 1 year ago

this is very useful thanks! I will clarify it all

vmoens commented 1 year ago

I edited the tuto based on your feedback. It is very (perhaps too much?) verbose right now.

@matteobettini regarding your question, the doc originates from a .py that you will be able to get by clicking on the view on GitHub button on top. That being said, you'll still get all the comments etc. Not sure it is what you want

viktor-ktorvi commented 1 year ago

The in and out keys are clear as day. frame_skip as well. I still had doubts about the frames_per_batch parameter at the beginning but by the end it was also clear as day. I think that I now get how the collector works perfectly.

It's very verbose but I'd rather have that information than not. I imagine that most people will focus on the code at first and then consult the text when they need to.

I really like the tutorial now, although I am biased having read it multiple times over. Thanks for taking the feedback!

shagunsodhani commented 1 year ago

Great work! Couple of minor suggestions below (only read till PPO section, will get back to rest later):

parametric policy network to solve the Ant task from

Tutorial is using the inverted pendulum task

Key learning items: - How to create an environment in TorchRL, transform its outputs, and collect data from this env; - How to make your classes talk to each other using TensorDict; - The basics of buildin

Indentation is messed up

We will cover six crucial components of TorchRL: environments, transforms, models (policy and value function), loss modules, data collectors and replay buffers.

Maybe add hyperlinks for each

elaborated version of REINFORCE

Maybe say "sophisticated version of REINFORCE"

Next, We will design

"Next, we will design"

Throughout this tutorial, we’ll be using the tensordict library.

Add link to tensordict

execute the policy on CUDA

"execute the policy on GPU"

vmoens commented 1 year ago

Thanks for the feedback everyone! @viktor-ktorvi @matteobettini @smorad, do you think that all the scripts under examples should look like this? I'm looking at them now and because we wanted them to cover multiple tasks etc they have become unreadable. I'd be happy to simplify them to something like what Matteo was suggesting.

viktor-ktorvi commented 1 year ago

Do you mean - should the code be split up or all together? I think the split with text inbetween is good. The view on github button does the trick if you wanna copy paste the entire thing.

viktor-ktorvi commented 1 year ago

By the way, I've been trying to run the ppo example and I get an error:

observation_spec: Traceback (most recent call last):
  File "C:\Users\viktor\PycharmProjects\torchrl control\gym_example.py", line 56, in <module>
    print("observation_spec:", env.observation_spec)
  File "C:\Users\viktor\anaconda3\envs\control\lib\site-packages\torchrl\data\tensor_specs.py", line 1767, in __repr__
    sub_str = [
  File "C:\Users\viktor\anaconda3\envs\control\lib\site-packages\torchrl\data\tensor_specs.py", line 1768, in <listcomp>
    indent(f"{k}: {str(item)}", 4 * " ") for k, item in self._specs.items()
  File "C:\Users\viktor\anaconda3\envs\control\lib\site-packages\torchrl\data\tensor_specs.py", line 416, in __repr__
    space_str = "space=" + str(self.space)
  File "C:\Users\viktor\anaconda3\envs\control\lib\site-packages\torchrl\data\tensor_specs.py", line 129, in __repr__
    min_str = f"minimum=Tensor(shape={self.minimum.shape}, device={self.minimum.device}, dtype={self.minimum.dtype}, contiguous={self.maximum.is_contiguous()})"
AttributeError: 'int' object has no attribute 'shape'
print("observation_spec:", env.observation_spec)

It has something to do with the StepCounter() transform.

matteobettini commented 1 year ago

Thanks for the feedback everyone! @viktor-ktorvi @matteobettini @smorad, do you think that all the scripts under examples should look like this? I'm looking at them now and because we wanted them to cover multiple tasks etc they have become unreadable. I'd be happy to simplify them to something like what Matteo was suggesting.

According to my personal taste, yes. While still keeping as much generality as you can. There is always this trade-off between simplicity and generality right? I personally lean more towards simplicity. I would make the top evel files in the example folder be simple and without helper functions. Then you can create a directory tree where you have examples for each extention (e.g. main ppo could be like the one i poposed, than you would ahve other files for multi-task ppo, multi-agent ppo, and so on)

vmoens commented 1 year ago

observation_spec:", env.observation_spec

I can't reproduce this, does it happen with the nightly?

vmoens commented 1 year ago

According to my personal taste, yes. While still keeping as much generality as you can. There is always this trade-off between simplicity and generality right? I personally lean more towards simplicity. I would make the top evel files in the example folder be simple and without helper functions. Then you can create a directory tree where you have examples for each extension (e.g. main ppo could be like the one i poposed, than you would ahve other files for multi-task ppo, multi-agent ppo, and so on)

@matteobettini Helper functions bring very little actually, and people sort of always take them as one of the main components of the library. Removing them altogether would be a plus for torchrl.

viktor-ktorvi commented 1 year ago

I can't reproduce this, does it happen with the nightly?

Traceback (most recent call last):
  File "C:\Users\viktor\PycharmProjects\control\nightly_torchrl\ppo.py", line 8, in <module>
    from torchrl.collectors import SyncDataCollector
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\__init__.py", line 35, in <module>
    import torchrl.collectors
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\collectors\__init__.py", line 6, in <module>
    from .collectors import (
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\collectors\collectors.py", line 28, in <module>
    from torchrl.data import TensorSpec
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\data\__init__.py", line 7, in <module>
    from .replay_buffers import (
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\data\replay_buffers\__init__.py", line 6, in <module>
    from .replay_buffers import (
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\data\replay_buffers\replay_buffers.py", line 16, in <module>
    from .samplers import PrioritizedSampler, RandomSampler, Sampler
  File "C:\Users\viktor\anaconda3\envs\nightly_torchrl_env\lib\site-packages\torchrl\data\replay_buffers\samplers.py", line 13, in <module>
    from torchrl._torchrl import (
ImportError: DLL load failed while importing _torchrl: The specified procedure could not be found.

Process finished with exit code 1

Can't even run the nightly because of the C++ extensions problem so I can't check.

smorad commented 1 year ago

Thanks for the feedback everyone! @viktor-ktorvi @matteobettini @smorad, do you think that all the scripts under examples should look like this? I'm looking at them now and because we wanted them to cover multiple tasks etc they have become unreadable. I'd be happy to simplify them to something like what Matteo was suggesting.

Yeah. I like looking at the code in a single continuous format, then going through the explanations for each block to clarify what I don't understand.

According to my personal taste, yes. While still keeping as much generality as you can. There is always this trade-off between simplicity and generality right? I personally lean more towards simplicity. I would make the top evel files in the example folder be simple and without helper functions. Then you can create a directory tree where you have examples for each extension (e.g. main ppo could be like the one i poposed, than you would ahve other files for multi-task ppo, multi-agent ppo, and so on)

@matteobettini Helper functions bring very little actually, and people sort of always take them as one of the main components of the library. Removing them altogether would be a plus for torchrl.

That would be great!

neo-alex commented 8 months ago

I love the modular spirit of TorchRL, very much appreciated for RL research! And seeing the list of features growing quickly is also very encouraging for the future of this library.

However, coming back to this issue after 1 year and despite all efforts already deployed to improve it, I feel that documentation & examples are still the main weak spots of TorchRL (probably slowing down its adoption):

If this issue becomes a priority again, I would recommend taking some inspiration from the very readable documentation & examples (both agent stubs and specific implementations... even for advanced distributed algorithms like Ape-X) from coax, another modular RL library - unfortunately not too much maintained anymore.

In any case, thanks again for all the great work that went into TorchRL already; this comment is just about making it more accessible and attractive to new-comers 😃

matteobettini commented 8 months ago

Yes. coming back to this after a while there is still some points that I feel can be improved.

One that we talked a lot about that I think is worth considering is changing the name of the examples folder. This folder is currently containing algorithm implementations and not easily approachable examples for newcomers. Propositions for the new name from my side are:

It would be useful then to have an examples folder with bare-bone examples for various tasks that users would use torchrl for (e.g., basic training loop, basic env creation), similar to the snippet mentioned earlier in this thread. These examples would be code-first (not much text) and short in length (aiming at < 100 lines)

vmoens commented 8 months ago

documentation could benefit from a "Getting Started" section

Agreed! What format should it take in your opinion? A section on the main page with a basic example? A page-by-page tutorial where one learns how to code a basic algo? I can get that done super quickly, I'm just wondering what format would make most people happy

neo-alex commented 8 months ago

Wrt. the Getting Started, I would personally love a minimal example (like the one from your TorchRL paper in Figure 1). That would be a great start! Of course, reworking/adding examples in general would be more difficult & time-consuming...

vmoens commented 8 months ago

Cool i'll work on that and ping you with it!

vmoens commented 7 months ago

I've started working on this here: #1886

Preview: https://docs-preview.pytorch.org/pytorch/rl/1886/tutorials/getting-started-0.html

The rendering is terrible and there are countless things to fix, but is this format ok?

I was thinking of having one on envs, another on writing a policy, another on losses and optim, one on collecting and storing data and finally one on logging (wandb and such).

neo-alex commented 7 months ago

Thank you for taking care of this so quickly.

I initially imagined something shorter than 5 new notebooks as a Getting Started - but having read them carefully, I actually liked how you explained every core concept step-by-step and felt like every bit of information here was valuable. Great work, I believe this will be a very nice addition to the documentation!

Just a few little things I noticed on the various notebooks:

1. [typo] We now need to pass this action tp the environment.

2. [formatting] To simplify the incorporation of Actor, # ProbabilisticActor, # ActorValueOperator or # ActorCriticOperator.

3. [typo] In TorchRL, we try to tread optimization as it is custom to do in PyTorch

4. [formatting] The most basic data collector is the :clas:~torchrl.collectors.SyncDataCollector [formatting] Second, it uses :classL:~tensordict.MemoryMappedTensor as a backend [typo] You can have look at other multirpocessed collectors

5. [reformulation] We’ll be making a regular, deterministic version to be used within the loss module and during evaluation, and one augmented by an exploration module for inference.

Thanks again!

vmoens commented 7 months ago

I think I fixed them all except for the last one!

I also specified that you can start with the last and quickly see what a training loop looks like before backtracking to other tutos, hope that makes sense and is closer to what you had in mind!