google-research / circuit_training

Apache License 2.0
881 stars 150 forks source link

Program blocked #42

Open Rejuy opened 1 year ago

Rejuy commented 1 year ago

Hi there! I ran into some problems when I'm running the project. I did as the README.md says, and when it was executing this line, it got blocked and never return. How could this happen? I have no idea. Could you give me some advice? Thanks a lot!

# learner.py
loss_info = self._generic_learner.run(self._steps_per_iter,
                                              self._train_iterator)
Rejuy commented 1 year ago

I enter the function and add some log. Surprisingly, I found that in the module learner.py of tf_agents, the return problem is acutally this:

  def run(self, iterations=1, iterator=None, parallel_iterations=10):
    """ ...
    """
    ...
    with self.train_summary_writer.as_default(), \
         common.soft_device_placement(), \
         tf.compat.v2.summary.record_if(_summary_record_if), \
         self.strategy.scope():
      iterator = iterator or self._experience_iterator
      loss_info = self._train(tf.constant(iterations),
                              iterator,
                              parallel_iterations)
      logging.info("return back to run")
      train_step_val = self.train_step.numpy()
      for trigger in self.triggers:
        trigger(train_step_val)

      return loss_info

@common.function(autograph=True)
  def _train(self, iterations, iterator, parallel_iterations):
    # ...
    logging.info("_train start")
    loss_info = self.single_train_step(iterator)
    for _ in tf.range(iterations - 1):
      tf.autograph.experimental.set_loop_options(
          parallel_iterations=parallel_iterations)
      loss_info = self.single_train_step(iterator)

    def _reduce_loss(loss):
        # ...

    # ...
    reduced_loss_info = tf.nest.map_structure(_reduce_loss, loss_info)
    logging.info("_train end")
    return reduced_loss_info

All log in _train can be found, indicating _train is done. However, it never returned to loss_info.

      loss_info = self._train(tf.constant(iterations),
                              iterator,
                              parallel_iterations)
      logging.info("return back to run")

This means that the log above never get printed. It's very weird. How could this happen?

ayamayaa commented 1 year ago

I come across the same issue. By any chance you got a solution? Thanks!

JIEEEN commented 10 months ago

i got same issue. how could this happen?

xiaosimaqian commented 2 months ago

I got the same issue. How was it resolved? Pls, Thanks!

xiaosimaqian commented 2 months ago

@Rejuy May I ask if there has been any progress on this issue, Thanks a lot.