pfnet / pfrl

PFRL: a PyTorch-based deep reinforcement learning library
MIT License
1.18k stars 158 forks source link

Actor processes hang in `train_agent_async` when `use_tensorboard=True` #88

Closed g-votte closed 3 years ago

g-votte commented 3 years ago

When I turned on Tensorboard with the actor-learner mode in train_dqn_gym.py, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.

Reproduction

I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.

Steps to reproduce

  1. In examples/gym/train_dqn_gym.py, add use_tensorboard=True as an argument of train_agent_async() (here)
  2. Run python examples/gym/train_dqn_gym.py --actor-learner

Result

The actor process hangs during the first set of evaluation, after showing the following log.

...
INFO:pfrl.experiments.train_agent_async:evaluation episode 96 length:200 R:-1494.8058766440454
INFO:pfrl.experiments.train_agent_async:evaluation episode 97 length:200 R:-1592.9273165459317
INFO:pfrl.experiments.train_agent_async:evaluation episode 98 length:200 R:-1533.3344787068036
INFO:pfrl.experiments.train_agent_async:evaluation episode 99 length:200 R:-1570.1153000497297

Expected behavior

The actor process continuously runs without the hang.

Analysis

The actor process stops here, during summary_writer.add_scalar, where Tensorboard's SummaryWriter seems to suffer from a deadlock.

I suspect that this problem happens because the _AsyncWriterThread, which is internally used in SummaryWriter, does not work in actor processes. Actor processes are forked from the root process with copy of SummaryWriter, but the associated threads, including the one for _AsyncWriterThread, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.