Training threads don't start on Windows

donamin commented 7 years ago

Hi

I started the learning a few minutes ago and this is what I got in command prompt:

E:\agents>python -m agents.scripts.train --logdir=E:\model --config=pendulum
INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\
20170918T084053-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.

It's been like this for about 10 minutes and tensorboard doesn't show anything. In the log directory, there is only one file called 'config.yaml'. Is it ok? It would be nice to see if the agent is progressing or it is hung or something.

Thanks Amin

donamin commented 7 years ago

I changed update_every value from 25 to 30 to resolve this warning: Number of agents should divide episodes per update. But still it doesn't seem to be working.

Weird thing is that sometimes when I run the code, I get the following exception:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

donamin commented 7 years ago

Update: When I change env_processes to False, it seems to be working! But I guess it disables all the parallelism that this framework is presenting, right?

danijar commented 7 years ago

It could be normal that TensorBoard doesn't show anything for a while. The frequency for writing logs is define inside _define_loop() in train.py. This is set to twice per epoch where one training epoch is config.update_every * config.max_length steps and one evaluation epoch is config.eval_episodes * config.max_length steps. It could be that either your environment is very slow or that an epoch consists of a large number of steps for you.

What environment are you using and how long are episodes typically? Can you post your full config?

donamin commented 7 years ago

I worked on that and it seems there's some other problem with the code: Now it's showing this error:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

If I change env_processes to False, it works! Do you know what's the problem?

danijar commented 7 years ago

Please wrap code blocks in 3 back ticks. Your configuration must be pickable and it looks like yours is not. Try to define it without using lambdas. As alternatives, define external functions, nested functions, or use functools.partial(). I need to see your configuration to help further.

donamin commented 7 years ago

OK I got an update:

In train.py, I changed this line: batch_env = utility.define_batch_env(lambda: _create_environment(config), config.num_agents, env_processes) into this: batch_env = utility.define_batch_env(_create_environment(config), config.num_agents, env_processes) Not it doesn't give me that previous error, but now it seems to be freezing after showing this log:

INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\20170922-165119-pendulum.
[2017-09-22 16:51:19,149] Making new env: Pendulum-v0

The CPU overload for my python is 0% so it doesn't to be doing anything. Any ideas?

This is my configs:

def default():
  """Default configuration for PPO."""
  # General
  algorithm = ppo.PPOAlgorithm
  num_agents = 10
  eval_episodes = 25
  use_gpu = False
  # Network
  network = networks.ForwardGaussianPolicy
  weight_summaries = dict(all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*')
  policy_layers = 200, 100
  value_layers = 200, 100
  init_mean_factor = 0.05
  init_logstd = -1
  # Optimization
  update_every = 30
  policy_optimizer = 'AdamOptimizer'
  value_optimizer = 'AdamOptimizer'
  update_epochs_policy = 50
  update_epochs_value = 50
  policy_lr = 1e-4
  value_lr = 3e-4
  # Losses
  discount = 0.985
  kl_target = 1e-2
  kl_cutoff_factor = 2
  kl_cutoff_coef = 1000
  kl_init_penalty = 1
  return locals()

danijar commented 7 years ago

Where is the env defined in your config? You should not create the environments in the main process as you did by removing the lambda.

donamin commented 7 years ago

I thought that we give env as one of the main arguments in command prompt. So how should I create the environments? You mean I should change the default code structure so I can make the BatchPPO work?

danijar commented 7 years ago

No, I meant you should undo the change you made to the batch env line. You define environments in your config by setting env = ... to either the name of a registered Gym environment or to a function that returns an env object.

donamin commented 7 years ago

Oh OK I found out what I did wrong with removing the lambda keyword. But how can I solve this using external or nested functions? I did a lot of searching but couldn't figure this out since I'm kind of new to Python. Can you help me with this? How is that it is working on your computer and not on mine? Because not being able to pickle lambda functions seems to be a Python feature, and I already tried Python 3.5 and 3.6.

danijar commented 7 years ago

I've seen it working on many people's computers :)

Please check if YAML is installed:

python3 -c "import ruamel.yaml; print('success')"

And check if the Pendulum environment works:

python3 -c "import gym; e=gym.make('Pendulum-v0'); e.reset(); e.render(); input('success')"

If both works please start from a fresh clone of this repository and report your error message again.

donamin commented 7 years ago

Thanks for your reply.

I tried both tests with success.

I cloned the repository again and the code doesn't work. It's not showing me that lambda error but it stays still when it reaches this line of code in wrappers.py: self._process.start()

When I use debugging, stepping into start function eventually takes guides me to this line in context.py (The code hangs when it reaches this line): from .popen_spawn_win32 import Popen

BTW, I'm using Windows 10. Maybe it has something to do with OS?

danijar commented 7 years ago

Yea, that might be the problem. Processing is quite different between Windows and Linux/Mac and we mainly tested on the latter. I'm afraid I can't be of much help since I don't use Windows. Do you have an idea how to debug this? I'd be happy to test and merge a fix if you come up with one.

donamin commented 7 years ago

OK thanks for your reply. I have no idea right now. But I will work on it because it's kind of important for me to make it work on Windows. I'll let you if it's solved. Thanks :)

danijar commented 7 years ago

@donamin Where you able to narrow down this issue?

donamin commented 7 years ago

@danijar No I couldn't solve it so I had to switch to linux. Sorry.

danijar commented 7 years ago

Thanks for getting back. I'll keep this issue open for now. We might support Windows in the future since as far as I can see the threading is the only platform-specific bit. But unfortunately, there are no concrete plans for this at the moment.

erwincoumans commented 6 years ago

It seems you cannot use the _worker class method for multiprocessing.Process on Windows. If you use a global def globalworker( constructor, conn): it will not hang. But then it cannot use getattr. Is there a way to rewrite _worker to be a globalworker?

   self._process = multiprocessing.Process(
        target=globalworker, args=(constructor, conn))

danijar commented 5 years ago

@erwincoumans Yes, this seems trivial since self._worker() does not access any object state. You'd just have to replace the occurrences of self with ExternalProcess. I'd be happy to accept a patch if this indeed fixes the behavior on Windows. I don't have a way to test on Windows myself.

google-research / batch-ppo

Training threads don't start on Windows #3