openai / coinrun

Code for the paper "Quantifying Transfer in Reinforcement Learning"
https://blog.openai.com/quantifying-generalization-in-reinforcement-learning/
MIT License
388 stars 87 forks source link

Hardware requirements #7

Closed a7b23 closed 5 years ago

a7b23 commented 5 years ago

Hi, I was running the code on my Mac and it's taking me around 2 minutes for a single parameter update which means that it would take around a month for the entire training to happen. What hardware were you training on and how much time did it take for you for the training to happen?

kcobbe commented 5 years ago

On a machine with 20 cpus and 8 GTX 1080 Ti graphics cards, training for 256M total timesteps (32M on each of 8 MPI processes) takes about 6 hours.

maximilianigl commented 5 years ago

I think it would be great to include the information how to start such an experiment in the main README file. I stumbled across the RCALL_NUM_GPU variable after digging around the code for a bit - I believe that's how you can tell it to use exactly one GPU for each of your MPI threads (i.e. be using the same amount for both GPUs and threads)? Using more than one thread per GPU didn't work for me due to memory.

maximilianigl commented 5 years ago

Also: Do I get that correctly that the number of steps that is used e.g. in tensorboard as the x-Axis is the steps per thread, i.e. I need to multiply that number by the amount of threads I have?

KaiyangZhou commented 4 years ago

Hi @maximilianigl, did you manage to run the code using multiple gpus? What's the correct command?

maximilianigl commented 4 years ago

Yes, it worked for me. My run command is RCALL_NUM_GPU=4 mpiexec -n 4 python3 -m coinrun.train_agent <options>. However, I'm working with a modified codebase (link) and I don't remember anymore if that's transferrable one-to-one.

Have you seen this repo? They incorporated coinrun into an entire suite of different environments with a nicer interface as well.

KaiyangZhou commented 4 years ago

@maximilianigl I see. Thanks!

I'm using this CUDA_VISIBLE_DEVICES=2,3 mpiexec -n 2 python -m coinrun.train_agent but it turned out that the code only used one gpu (DEVICE=2) for training and the other gpu's utility was zero. Though it loaded twice the memory but the training speed is very slow, even much slower than just CUDA_VISIBLE_DEVICES=2 python -m coinrun.train_agent. Any idea what could be the cause?

p.s. my mpiexec version is 2.1.1

Screenshot 2020-05-21 at 12 52 13

(I'm only using gpu2,3)

maximilianigl commented 4 years ago

Have you tried to set RCALL_NUM_GPU=2 as well?

KaiyangZhou commented 4 years ago

Yes, but RCALL_NUM_GPU eventually convert to CUDA_VISIBLE_DEVICES as shown here https://github.com/openai/coinrun/blob/master/coinrun/main_utils.py#L111, and I ended up using other people's devices lol

maximilianigl commented 4 years ago

Ah, I see. It worked for me, because I was running it inside a docker container that has only access to the GPUs I want it to use. What about modifying the function you linked to such that it picks one of several GPUs that you specify via command-line options, based on which local_rank the corresponding process has? I.e. having something like --use-gpus 3,4 and then making sure local_rank==0 gets CUDA_VISIBLE_DEVICES=3 and local_rank==1 get CUDA_VISIBLE_DEVICES=4?

There might very well be a cleaner solution, but I'm not an expert enough on TF to know it :)

KaiyangZhou commented 4 years ago

@maximilianigl Thanks! I'll try your solution.

I'm not a TF guy either :D

KaiyangZhou commented 4 years ago

@maximilianigl Your solution works.

This is what I changed in coinrun/main_utils.py:

def setup_mpi_gpus():
    if 'RCALL_NUM_GPU' not in os.environ:
        return
    num_gpus = int(os.environ['RCALL_NUM_GPU'])
    node_id = platform.node()
    nodes = MPI.COMM_WORLD.allgather(node_id)
    local_rank = len([n for n in nodes[:MPI.COMM_WORLD.Get_rank()] if n == node_id])
    # os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)

    # e.g. AVAI_DEVICES=0,1
    avai_devices = os.environ['AVAI_DEVICES']
    avai_devices = avai_devices.split(',')
    os.environ['CUDA_VISIBLE_DEVICES'] = avai_devices[local_rank % num_gpus]

And the command line is

AVAI_DEVICES=2,3 RCALL_NUM_GPU=2 mpiexec -np 2 python -m coinrun.train_agent

Now the code is successfully running on gpu2&3

Screenshot 2020-05-21 at 15 47 47

However, the training speed seems to not benefit from using two GPUs. I print out the estimated arrival time, it shows the job will be finished in around 28 hours, which is similar to that of using one gpu for training. Am I missing something?

p.s. each step takes around 3.1275 seconds, is this too slow?

maximilianigl commented 4 years ago

Not sure. I've got about 4000 fps when I run it. Maybe your V100s are fast enough that they aren't the bottleneck when running two threads on one GPU?

KaiyangZhou commented 4 years ago

I'm running mpiexec -np 8 with 2 GPUs and I got around 1000 fps, which maybe too low? hmm, need to find the reason

maximilianigl commented 4 years ago

I'd say with 8 threads you'd probably need at least 4 GPUs, even if they're V100s, at least when you're using the IMPALA architecture. Not sure how much resources the Atari architecture needs.

KaiyangZhou commented 4 years ago

yea make sense, I'll just stick to one gpu then, maybe just chilling when it takes 1 day to run

maximilianigl commented 4 years ago

Just to point out: All the values are per thread, i.e. running on more threads with the same number of GPUs makes it slower, even though you might overall consume more frames.

KaiyangZhou commented 4 years ago

I see, thanks!