Open sorryformyself opened 3 years ago
Also, the writer.flush() in learning process may take quite a part of time.
@sorryformyself Thank you for reporting.
It is quite difficult to say without knowing detail, but enlarging local buffer size might reduce lock waiting.
@keiohta
Apex calls SummaryWriter.flush()
at every iteration, which is usually too much and inefficient.
You can reduce the number of calling depending on the frequency of checking on TensorBoard.
Also I think it is good idea to let SummaryWriter
decide timing.
By default, SummaryWriter
flushes at every 10 iterations or 2 minutes. (Of course you can manage.)
https://www.tensorflow.org/api_docs/python/tf/summary/create_file_writer
@ymd-h Thanks for your patient reply!
I've enlarged the local buffer size, and add a background thread to fetch samples before training, like
def sample(lock, global_rb, batch_size, tf_queue):
while True:
lock.acquire()
samples = global_rb.sample(batch_size)
lock.release()
tf_queue.enqueue(samples)
and
samples = tf_queue.dequeue()
,
a little improvement appears, but still too slow compared to 19 transitions per second like ape-x paper said :(
Hi @sorryformyself , sorry for late reply. I'm on vacation.
Yeah, current my implementation is bad, and I have some idea to improve it as follows:
writer.flush()
(this is really inefficient. thanks @ymd-h !)I think I'll be able to provide the improved codes in a week. Thanks!
in learner process of apex, most of the time are spent in sampling, lock require, and priority updating. In my 1080ti, I only got less than 20% utilization. So is there any tricks to train faster?
My batch size is 512