google-research / seed_rl

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. Implements IMPALA and R2D2 algorithms in TF2 with SEED's architecture.
Apache License 2.0
793 stars 146 forks source link

Missing learning curve data for `Defender` `Surround` #78

Closed vwxyzjn closed 1 year ago

vwxyzjn commented 2 years ago

Hi, thanks for making the learning curves data available! I am using it for my study but noticed the csv file in https://github.com/google-research/seed_rl/blob/master/docs/r2d2_atari_training_curves.md lacks the data for Defender and Surround. Would you mind looking into it?

E.g., the Surround learning curve should come after StarGunner

image

Thank you!

brieyla1 commented 2 years ago

the frames are similar to IMPALA, using V-TRACE, in my opinion your best bet is to run it yourself with GPUs, the TPU setup is very hard and still being discussed.

vwxyzjn commented 2 years ago

Hi @brieyla1, thanks for the suggestion. I was hoping to use the TPU learning curves for a paper, and it's probably gonna be an inconsistent setup if I replicated the experiments with GPU or used a different algorithm like IMPALA :/

brieyla1 commented 2 years ago

I followed the project since its early days, never got a TPU to run on it unfortunately.

The tpu doesn't work as expected really, it was developed with hyper nightly versions that don't work anymore from what I know.

@lespeholt from google, may know a bit more.

I consider this project deprecated as of now.

lespeholt commented 2 years ago

@vwxyzjn There shouldn't be a big difference between GPU and TPU or IMPALA or batched A2C for that matter, if you stick to small scale setup.

We always wanted a nice public TPU version. However, due to full TPU release on Cloud didn't get available in time, the effort for that died out a bit. And yeah, other priorities took over :-)

vwxyzjn commented 2 years ago

@lespeholt Thank you for the response! On a related note, any chance you have the learning curves raw data for IMPALA? We would like to know the level of human median scores achieved within the first hour.

In the IMPALA paper, it was mentioned that the "shallow IMPALA (Atari) experiment completes training over 200 million frames in less than one hour", but we are wondering 1) how much time did the deep IMPALA experiment take and 2) how much time did the shallow IMPALA take exactly. The closest paper that has this info is R2D2, but the x-axis has a much larger scale, as shown below:

Screen Shot 2022-09-27 at 12 08 03 PM

We are working on distributed RL in a simplified setup that only does rollouts asynchronously, and we would love to compare prior state-of-the-art works such as yours. Because we are limited by computational resources, we can only run experiments up to an hour per game per seed, which is difficult to compare past works without the raw learning curves.

I would be happy to provide more info / chat more if you are interested. Your work is awesome, and I would love to connect and learn from you :)

lespeholt commented 2 years ago

Unfortunately I don't have access to the IMPALA curves.

1) ". Using 8 learner GPUs instead of 1 further speeds up training of the deep model by a factor of 7 to 210K frames/sec, up from 30K frames/sec." so it's roughly 200/30 = 6.67x slower with one GPU, and slightly faster with 8 GPUs (so not quite linear scale due to communication between GPUs). SEED and Cloud TPUs scales better.

2) See table 1. If we assume batch size 32, it is 200000000/200000/3600*60 = 17 minutes. With SEED you can actually get under 1 minute if necessary.

Best, Lasse

vwxyzjn commented 2 years ago

Thank you, Lasse! I might need to bother you with some additional questions...

Both Table 1 and Figure 6 show the SPS number in DMLab-30. Would those numbers be the same for Atari environments? Does the DMLab run at the same speed as Atari?

See table 1. If we assume batch size 32, it is 200000000/200000/3600*60 = 17 minutes. With SEED you can actually get under 1 minute if necessary.

Does this mean IMPALA (shallow) obtains 93.2% median human normalized scores (HNS) after 17 minutes of training? Would this imply SEED RL can obtain 93.2% median HNS under 1 minute of training?

When plotting using the SEED RL's R2D2 learning curves, I find that obtaining 100% median HNS (when omitting Defender and Surround) took about an hour.

Screen Shot 2022-09-29 at 1 37 23 PM

Here is a direct analysis of Figure 6 in the SEED RL paper.

image

It seems that sample efficiency could get lower when increasing throughput. Setting throughput aside, I am interested in the wall-time performance. That is, what is the published algorithm that has the minimum amount of training wall-time hours to make the agent reach 100% median HNS?