The apex implementation seems quite slow

keiohta / tf2rl

TensorFlow2 Reinforcement Learning

MIT License

461 stars 104 forks source link

The apex implementation seems quite slow #117

Open sorryformyself opened 3 years ago

sorryformyself commented 3 years ago

in learner process of apex, most of the time are spent in sampling, lock require, and priority updating. In my 1080ti, I only got less than 20% utilization. So is there any tricks to train faster?

My batch size is 512

sorryformyself commented 3 years ago

Also, the writer.flush() in learning process may take quite a part of time.

ymd-h commented 3 years ago

@sorryformyself Thank you for reporting.

It is quite difficult to say without knowing detail, but enlarging local buffer size might reduce lock waiting.

@keiohta Apex calls SummaryWriter.flush() at every iteration, which is usually too much and inefficient.

https://github.com/keiohta/tf2rl/blob/58e05f7a096b80282c75b37d449d9768a672208c/tf2rl/algos/apex.py#L229

You can reduce the number of calling depending on the frequency of checking on TensorBoard. Also I think it is good idea to let SummaryWriter decide timing. By default, SummaryWriter flushes at every 10 iterations or 2 minutes. (Of course you can manage.) https://www.tensorflow.org/api_docs/python/tf/summary/create_file_writer

sorryformyself commented 3 years ago

@ymd-h Thanks for your patient reply!

I've enlarged the local buffer size, and add a background thread to fetch samples before training, like

def sample(lock, global_rb, batch_size, tf_queue):
    while True:
        lock.acquire()
        samples = global_rb.sample(batch_size)
        lock.release()
        tf_queue.enqueue(samples)

and samples = tf_queue.dequeue(),

a little improvement appears, but still too slow compared to 19 transitions per second like ape-x paper said :(

keiohta commented 3 years ago

Hi @sorryformyself , sorry for late reply. I'm on vacation.

Yeah, current my implementation is bad, and I have some idea to improve it as follows:

remove writer.flush() (this is really inefficient. thanks @ymd-h !)
use multithread env instead of current multiprocessing env referring to this paper (see Fig.1.a if you are interested)

I think I'll be able to provide the improved codes in a week. Thanks!