iffiX / machin

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...
MIT License
397 stars 51 forks source link

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync' when running tutorials #17

Closed MarWaltz closed 3 years ago

MarWaltz commented 3 years ago

Hello, when I am trying to run a tutorial script, e.g. the your_first_program example, I always encounter this AttributeError during the imports:

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

However, I fulfill the listed requirements. Is there anything I am missing or have can I solve this?

iffiX commented 3 years ago

which version of torch and os platform are you using?

MarWaltz commented 3 years ago

torch version is 1.8.1+cu111 and my os is Windows 10, but I already tried several different torch versions.

iffiX commented 3 years ago

Yeah, windows torch does not support rpc_sync and any distributed model that is using this function (IMPALA, A3C, etc). So far I don't have a windows platform to test so there might be some import errors. Could you please show the detailed error stack in python?

MarWaltz commented 3 years ago

Of course, see below:

Traceback (most recent call last): File "c:.py", line 1, in from machin.frame.algorithms import DQN File "C:...\AppData\Local\Programs\Python\Python38\lib\site-packages\machin__init.py", line 1, in from . import env, frame, model, parallel, utils File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\init.py", line 1, in from . import utils, wrappers File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\wrappers\init.py", line 1, in from . import base, openai_gym File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\wrappers\openai_gym.py", line 8, in from machin.parallel.exception import ExceptionWithTraceback File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\init.py", line 2, in from . import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\init__.py", line 1, in from .world import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 535, in class RpcGroup: File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 550, in RpcGroup @_copy_doc(rpc.rpc_sync) AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

iffiX commented 3 years ago

Oh, that error is easy to fix, for now as a temporary fix you need to do the following changes: In file https://github.com/iffiX/machin/blob/master/machin/parallel/__init__.py (C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel__init__.py on your local system)

  1. Remove from . import distributed
  2. Remove "distributed" from __all__ The wrapper you are using does not depend on rpc functions.

Please notify me if any other import errors persist.

MarWaltz commented 3 years ago

I did make these changes, but unfortunately I still run into the following:

Traceback (most recent call last): File "c:\.py", line 1, in from machin.frame.algorithms import DQN File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin__init.py", line 1, in from . import env, frame, model, parallel, utils File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\init.py", line 1, in from . import utils, wrappers File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\wrappers\init.py", line 1, in from . import base, openai_gym File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\env\wrappers\openai_gym.py", line 8, in from machin.parallel.exception import ExceptionWithTraceback File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\init.py", line 2, in from . import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\server\init.py", line 1, in from . import ordered_server File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\server\ordered_server.py", line 5, in from ..distributed import RpcGroup File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\init__.py", line 1, in from .world import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 535, in class RpcGroup: File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 550, in RpcGroup @_copy_doc(rpc.rpc_sync) AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

iffiX commented 3 years ago

Oh I forgot the "server", you also need to remove that. Sorry for this inconvenience.

MarWaltz commented 3 years ago

No worries. But still:

Traceback (most recent call last): File "c:\.py", line 1, in from machin.frame.algorithms import DQN File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin__init.py", line 1, in from . import env, frame, model, parallel, utils File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\init.py", line 1, in from . import algorithms, buffers, noise, transition File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\algorithms\init.py", line 3, in from .dqn import DQN File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\algorithms\dqn.py", line 8, in from machin.frame.buffers.buffer import Transition, Buffer File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\buffers\init.py", line 2, in from .buffer_d import DistributedBuffer File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\buffers\buffer_d.py", line 5, in from machin.parallel.distributed import RpcGroup File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\init__.py", line 1, in from .world import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 535, in class RpcGroup: File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 550, in RpcGroup @_copy_doc(rpc.rpc_sync) AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

iffiX commented 3 years ago

OK for these errors you need to change the ImportError to Exception in these two files: https://github.com/iffiX/machin/blob/master/machin/frame/algorithms/__init__.py https://github.com/iffiX/machin/blob/master/machin/frame/buffers/__init__.py

Because AttributeError is not captured here.

MarWaltz commented 3 years ago

Okay thanks, I will have a look into it and come back to you tomorrow.

iffiX commented 3 years ago

No problem, I will correct these problem in my code now, and try to find a windows testing environment.

MarWaltz commented 3 years ago

Hello again, see below:

Traceback (most recent call last): File "c:..\Desktop\Forschung\RL\Implementations\PyTorch Templates\machin\CartPole-DQN.py", line 1, in from machin.frame.algorithms import DQN File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin__init.py", line 1, in from . import env, frame, model, parallel, utils File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\init.py", line 1, in from . import algorithms, buffers, helpers, noise, transition File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\algorithms\init.py", line 14, in from .a3c import A3C File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\frame\algorithms\a3c.py", line 2, in from machin.parallel.server import PushPullGradServer File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\server\init.py", line 1, in from . import ordered_server File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\server\ordered_server.py", line 5, in from ..distributed import RpcGroup, debug_with_process File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\init__.py", line 1, in from .world import ( File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 585, in class RpcGroup: File "C:..\AppData\Local\Programs\Python\Python38\lib\site-packages\machin\parallel\distributed\world.py", line 600, in RpcGroup @_copy_doc(rpc.rpc_sync) AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

iffiX commented 3 years ago

OK, now move from .a3c import A3C to that try except block: https://github.com/iffiX/machin/blob/baa093d85cfc578815e0adc85084f14abdbbd87d/machin/frame/algorithms/__init__.py#L23 like this:

try:
    from .a3c import A3C
    from .apex import DQNApex, DDPGApex
    from .impala import IMPALA
    from .ars import ARS
except Exception as _:
    warnings.warn(
        "Failed to import algorithms relying on torch.distributed." " Set them to None."
    )
    A3C = None
    DQNApex = None
    DDPGApex = None
    IMPALA = None
    ARS = None
MarWaltz commented 3 years ago

Great job, this example works fine now! I will close this issue and open a new one if any further problems should occur.

Thanks again.

iffiX commented 3 years ago

OK, during this time I will add a quick fix to this when I got circleci working. :)

iffiX commented 3 years ago

After searching for a while I cannot find a platform with reasonable time for my auto testing, and since it is too difficult to maintain a hybrid jenkins-windows-vm setup I will not consider windows CI in the near future.

As a complement, I will do a one-time testing manually for requested future versions.

oneoneonecy commented 1 year ago

can help me that below: @rpc.functions.async_execution AttributeError: module 'torch.distributed.rpc' has no attribute 'functions'