ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.07k stars 5.78k forks source link

[core] nightly test failure investigation #20087

Closed fishbone closed 3 years ago

fishbone commented 3 years ago

Test failures in recent run

| long_running_tests | apex                                     |   0.25 | 1 | 4 | true |
| long_running_tests | many_drivers                             |    0.5 | 2 | 4 | true |
| long_running_tests | node_failures                            |    0.5 | 2 | 4 | true |
| long_running_tests | pbt                                      | 0.6667 | 2 | 3 | true |
| nightly_tests      | dask_on_ray_large_scale_test_no_spilling | 0.6667 | 4 | 6 | true |
| nightly_tests      | dask_on_ray_large_scale_test_spilling    | 0.7143 | 5 | 7 | true |
fishbone commented 3 years ago

In https://github.com/ray-project/ray/pull/20076

many_drivers node_failures dask_on_ray_large_scale_test_no_spilling dask_on_ray_large_scale_test_spilling

should have been fixed

fishbone commented 3 years ago

pbt log looks wired:

2021-11-02 17:00:00,859 INFO worker.py:839 -- Connecting to existing Ray cluster at address: 172.31.66.49:6379
/home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
2021-11-02 17:00:01,761 WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.r
....
(PG pid=1033)     spaces=spaces,
(PG pid=1033)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 489, in _make_worker
(PG pid=1033)     spaces=spaces,
(PG pid=1033)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 568, in __init__
(PG pid=1033)     devices = get_tf_gpu_devices()
(PG pid=1033)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/tf_utils.py", line 76, in get_gpu_devices
(PG pid=1033)     devices = tf.config.experimental.list_physical_devices()
(PG pid=1033) AttributeError: 'NoneType' object has no attribute 'config'
Traceback (most recent call last):
  File "workloads/pbt.py", line 58, in <module>
    verbose=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 735, in run_experiments
    callbacks=callbacks).trials
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 627, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PG_CartPole-v0_fe39e_00000, PG_CartPole-v0_fe39e_00001, PG_CartPole-v0_fe39e_00002, PG_CartPole-v0_fe39e_00003, PG_CartPole-v0_fe39e_00004, PG_CartPole-v0_fe39e_00005, PG_CartPole-v0_fe39e_00006, PG_CartPole-v0_fe39e_00007])
2021-11-02 17:00:06,975 ERROR worker.py:1241 -- listen_error_messages_raylet: Connection closed by server.
2021-11-02 17:00:06,982 ERROR worker.py:473 -- print_logs: Connection closed by server.
2021-11-02 17:00:06,983 ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
[2021-11-02 17:10:06,295 C 715 956] gcs_server_address_updater.cc:67: Failed to receive the GCS address from the raylet for 600 times. Killing itself.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    std::_Function_handler<>::_M_invoke()
    ray::rpc::ClientCallImpl<>::OnReplyReceived()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    std::thread::_State_impl<>::_M_run()
    execute_native_thread_routine
fishbone commented 3 years ago

Apex log

2021-11-02 17:24:53,090 INFO worker.py:839 -- Connecting to existing Ray cluster at address: 172.31.127.231:6379
/home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
2021-11-02 17:24:53,925 WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.rllib.utils.torch_utils.[...]` instead. This will raise an error in the future!
(bundle_reservation_check_func pid=1255) /home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
(bundle_reservation_check_func pid=1255)   for external in metadata.entry_points().get(self.group, []):
(bundle_reservation_check_func pid=1255) 2021-11-02 17:24:56,169        WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.rllib.utils.torch_utils.[...]` instead. This will raise an error in the future!
(APEX pid=1255) 2021-11-02 17:24:56,573 INFO dqn.py:142 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(APEX pid=1255) 2021-11-02 17:24:56,573 INFO trainer.py:708 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=1286) /home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
(pid=1286)   for external in metadata.entry_points().get(self.group, []):
(pid=1287) /home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
(pid=1287)   for external in metadata.entry_points().get(self.group, []):
(pid=1288) /home/ray/anaconda3/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
(pid=1288)   for external in metadata.entry_points().get(self.group, []):
(pid=1286) 2021-11-02 17:24:58,060      WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.rllib.utils.torch_utils.[...]` instead. This will raise an error in the future!
(pid=1287) 2021-11-02 17:24:58,058      WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.rllib.utils.torch_utils.[...]` instead. This will raise an error in the future!
(pid=1288) 2021-11-02 17:24:58,039      WARNING deprecation.py:46 -- DeprecationWarning: `ray.rllib.utils.torch_ops.[...]` has been deprecated. Use `ray.rllib.utils.torch_utils.[...]` instead. This will raise an error in the future!
2021-11-02 17:24:58,384 ERROR trial_runner.py:947 -- Trial APEX_Pong-v0_77902_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 913, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 784, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1645, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::APEX.__init__() (pid=1255, ip=172.31.127.231)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 133, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 643, in __init__
    super().__init__(config, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 107, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 143, in setup
    super().setup(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 718, in setup
    self._init(self.config, self.env_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 172, in _init
    num_workers=self.config["num_workers"])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1523, in _make_workers
    logdir=self.logdir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 89, in __init__
    lambda p, pid: (pid, p.observation_space, p.action_space)))
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 235, in make
    return registry.make(id, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 129, in make
    env = spec.make(**kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 90, in make
    env = cls(**_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 123, in __init__
    self.seed()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 172, in seed
    f"We're Unable to find the game \"{self._game}\". Note: Gym no longer distributes ROMs. "
gym.error.Error: We're Unable to find the game "Pong". Note: Gym no longer distributes ROMs. If you own a license to use the necessary ROMs for research purposes you can download them via `pip install gym[accept-rom-license]`. Otherwise, you should try importing "Pong" via the command `ale-import-roms`. If you believe this is a mistake perhaps your copy of "Pong" is unsupported. To check if this is the case try providing the environment variable `PYTHONWARNINGS=default::ImportWarning:ale_py.roms`. For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management

During handling of the above exception, another exception occurred:

ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
    self.env = env_creator(copy.deepcopy(self.env_context))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/env/utils.py", line 54, in gym_env_creator
    raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
ray.rllib.utils.error.EnvError: The env string you provided ('Pong-v0') is:
a) Not a supported/installed environment.
b) Not a tune-registered environment creator.
c) Not a valid env class string.

Try one of the following:
a) For Atari support: `pip install gym[atari] atari_py`.
   For VizDoom support: Install VizDoom
   (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
   `pip install vizdoomgym`.
   For PyBullet support: `pip install pybullet`.
b) To register your custom env, do `from ray import tune;
   tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
   Then in your config, do `config['env'] = [name]`.
c) Make sure you provide a fully qualified classpath, e.g.:
   `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
(APEX pid=1255) 2021-11-02 17:24:58,352 ERROR worker.py:426 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::APEX.__init__() (pid=1255, ip=172.31.127.231)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 133, in __init__
(APEX pid=1255)     Trainer.__init__(self, config, env, logger_creator)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 643, in __init__
(APEX pid=1255)     super().__init__(config, logger_creator)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 107, in __init__
(APEX pid=1255)     self.setup(copy.deepcopy(self.config))
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 143, in setup
(APEX pid=1255)     super().setup(config)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 718, in setup
(APEX pid=1255)     self._init(self.config, self.env_creator)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 172, in _init
(APEX pid=1255)     num_workers=self.config["num_workers"])
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1523, in _make_workers
(APEX pid=1255)     logdir=self.logdir)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 89, in __init__
(APEX pid=1255)     lambda p, pid: (pid, p.observation_space, p.action_space)))
(APEX pid=1255) ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 235, in make
(APEX pid=1255)     return registry.make(id, **kwargs)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 129, in make
(APEX pid=1255)     env = spec.make(**kwargs)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 90, in make
(APEX pid=1255)     env = cls(**_kwargs)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 123, in __init__
(APEX pid=1255)     self.seed()
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 172, in seed
(APEX pid=1255)     f"We're Unable to find the game \"{self._game}\". Note: Gym no longer distributes ROMs. "
(APEX pid=1255) gym.error.Error: We're Unable to find the game "Pong". Note: Gym no longer distributes ROMs. If you own a license to use the necessary ROMs for research purposes you can download them via `pip install gym[accept-rom-license]`. Otherwise, you should try importing "Pong" via the command `ale-import-roms`. If you believe this is a mistake perhaps your copy of "Pong" is unsupported. To check if this is the case try providing the environment variable `PYTHONWARNINGS=default::ImportWarning:ale_py.roms`. For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management
(APEX pid=1255) 
(APEX pid=1255) During handling of the above exception, another exception occurred:
(APEX pid=1255) 
(APEX pid=1255) ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(APEX pid=1255)     self.env = env_creator(copy.deepcopy(self.env_context))
(APEX pid=1255)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/env/utils.py", line 54, in gym_env_creator
(APEX pid=1255)     raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
(APEX pid=1255) ray.rllib.utils.error.EnvError: The env string you provided ('Pong-v0') is:
(APEX pid=1255) a) Not a supported/installed environment.
(APEX pid=1255) b) Not a tune-registered environment creator.
(APEX pid=1255) c) Not a valid env class string.
(APEX pid=1255) 
(APEX pid=1255) Try one of the following:
(APEX pid=1255) a) For Atari support: `pip install gym[atari] atari_py`.
(APEX pid=1255)    For VizDoom support: Install VizDoom
(APEX pid=1255)    (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
(APEX pid=1255)    `pip install vizdoomgym`.
(APEX pid=1255)    For PyBullet support: `pip install pybullet`.
(APEX pid=1255) b) To register your custom env, do `from ray import tune;
(APEX pid=1255)    tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
(APEX pid=1255)    Then in your config, do `config['env'] = [name]`.
(APEX pid=1255) c) Make sure you provide a fully qualified classpath, e.g.:
(APEX pid=1255)    `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
(RolloutWorker pid=1286) A.L.E: Arcade Learning Environment (version +978d2ce)
(RolloutWorker pid=1286) [Powered by Stella]
(RolloutWorker pid=1286) 2021-11-02 17:24:58,343        ERROR worker.py:426 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 235, in make
(RolloutWorker pid=1286)     return registry.make(id, **kwargs)
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 129, in make
(RolloutWorker pid=1286)     env = spec.make(**kwargs)
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 90, in make
(RolloutWorker pid=1286)     env = cls(**_kwargs)
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 123, in __init__
(RolloutWorker pid=1286)     self.seed()
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 172, in seed
(RolloutWorker pid=1286)     f"We're Unable to find the game \"{self._game}\". Note: Gym no longer distributes ROMs. "
(RolloutWorker pid=1286) gym.error.Error: We're Unable to find the game "Pong". Note: Gym no longer distributes ROMs. If you own a license to use the necessary ROMs for research purposes you can download them via `pip install gym[accept-rom-license]`. Otherwise, you should try importing "Pong" via the command `ale-import-roms`. If you believe this is a mistake perhaps your copy of "Pong" is unsupported. To check if this is the case try providing the environment variable `PYTHONWARNINGS=default::ImportWarning:ale_py.roms`. For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management
(RolloutWorker pid=1286) 
(RolloutWorker pid=1286) During handling of the above exception, another exception occurred:
(RolloutWorker pid=1286) 
(RolloutWorker pid=1286) ray::RolloutWorker.__init__() (pid=1286, ip=172.31.127.231)
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(RolloutWorker pid=1286)     self.env = env_creator(copy.deepcopy(self.env_context))
(RolloutWorker pid=1286)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/env/utils.py", line 54, in gym_env_creator
(RolloutWorker pid=1286)     raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
(RolloutWorker pid=1286) ray.rllib.utils.error.EnvError: The env string you provided ('Pong-v0') is:
(RolloutWorker pid=1286) a) Not a supported/installed environment.
(RolloutWorker pid=1286) b) Not a tune-registered environment creator.
(RolloutWorker pid=1286) c) Not a valid env class string.
(RolloutWorker pid=1286) 
(RolloutWorker pid=1286) Try one of the following:
(RolloutWorker pid=1286) a) For Atari support: `pip install gym[atari] atari_py`.
(RolloutWorker pid=1286)    For VizDoom support: Install VizDoom
(RolloutWorker pid=1286)    (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
(RolloutWorker pid=1286)    `pip install vizdoomgym`.
(RolloutWorker pid=1286)    For PyBullet support: `pip install pybullet`.
(RolloutWorker pid=1286) b) To register your custom env, do `from ray import tune;
(RolloutWorker pid=1286)    tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
(RolloutWorker pid=1286)    Then in your config, do `config['env'] = [name]`.
(RolloutWorker pid=1286) c) Make sure you provide a fully qualified classpath, e.g.:
(RolloutWorker pid=1286)    `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
(RolloutWorker pid=1287) A.L.E: Arcade Learning Environment (version +978d2ce)
(RolloutWorker pid=1287) [Powered by Stella]
(RolloutWorker pid=1287) 2021-11-02 17:24:58,379        ERROR worker.py:426 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=1287, ip=172.31.127.231)
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 235, in make
(RolloutWorker pid=1287)     return registry.make(id, **kwargs)
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 129, in make
(RolloutWorker pid=1287)     env = spec.make(**kwargs)
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 90, in make
(RolloutWorker pid=1287)     env = cls(**_kwargs)
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 123, in __init__
(RolloutWorker pid=1287)     self.seed()
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 172, in seed
(RolloutWorker pid=1287)     f"We're Unable to find the game \"{self._game}\". Note: Gym no longer distributes ROMs. "
(RolloutWorker pid=1287) gym.error.Error: We're Unable to find the game "Pong". Note: Gym no longer distributes ROMs. If you own a license to use the necessary ROMs for research purposes you can download them via `pip install gym[accept-rom-license]`. Otherwise, you should try importing "Pong" via the command `ale-import-roms`. If you believe this is a mistake perhaps your copy of "Pong" is unsupported. To check if this is the case try providing the environment variable `PYTHONWARNINGS=default::ImportWarning:ale_py.roms`. For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management
(RolloutWorker pid=1287) 
(RolloutWorker pid=1287) During handling of the above exception, another exception occurred:
(RolloutWorker pid=1287) 
(RolloutWorker pid=1287) ray::RolloutWorker.__init__() (pid=1287, ip=172.31.127.231)
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(RolloutWorker pid=1287)     self.env = env_creator(copy.deepcopy(self.env_context))
(RolloutWorker pid=1287)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/env/utils.py", line 54, in gym_env_creator
(RolloutWorker pid=1287)     raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
(RolloutWorker pid=1287) ray.rllib.utils.error.EnvError: The env string you provided ('Pong-v0') is:
(RolloutWorker pid=1287) a) Not a supported/installed environment.
(RolloutWorker pid=1287) b) Not a tune-registered environment creator.
(RolloutWorker pid=1287) c) Not a valid env class string.
(RolloutWorker pid=1287) 
(RolloutWorker pid=1287) Try one of the following:
(RolloutWorker pid=1287) a) For Atari support: `pip install gym[atari] atari_py`.
(RolloutWorker pid=1287)    For VizDoom support: Install VizDoom
(RolloutWorker pid=1287)    (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
(RolloutWorker pid=1287)    `pip install vizdoomgym`.
(RolloutWorker pid=1287)    For PyBullet support: `pip install pybullet`.
(RolloutWorker pid=1287) b) To register your custom env, do `from ray import tune;
(RolloutWorker pid=1287)    tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
(RolloutWorker pid=1287)    Then in your config, do `config['env'] = [name]`.
(RolloutWorker pid=1287) c) Make sure you provide a fully qualified classpath, e.g.:
(RolloutWorker pid=1287)    `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
(RolloutWorker pid=1288) A.L.E: Arcade Learning Environment (version +978d2ce)
(RolloutWorker pid=1288) [Powered by Stella]
(RolloutWorker pid=1288) 2021-11-02 17:24:58,356        ERROR worker.py:426 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=1288, ip=172.31.127.231)
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 235, in make
(RolloutWorker pid=1288)     return registry.make(id, **kwargs)
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 129, in make
(RolloutWorker pid=1288)     env = spec.make(**kwargs)
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/registration.py", line 90, in make
(RolloutWorker pid=1288)     env = cls(**_kwargs)
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 123, in __init__
(RolloutWorker pid=1288)     self.seed()
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/gym/envs/atari/environment.py", line 172, in seed
(RolloutWorker pid=1288)     f"We're Unable to find the game \"{self._game}\". Note: Gym no longer distributes ROMs. "
(RolloutWorker pid=1288) gym.error.Error: We're Unable to find the game "Pong". Note: Gym no longer distributes ROMs. If you own a license to use the necessary ROMs for research purposes you can download them via `pip install gym[accept-rom-license]`. Otherwise, you should try importing "Pong" via the command `ale-import-roms`. If you believe this is a mistake perhaps your copy of "Pong" is unsupported. To check if this is the case try providing the environment variable `PYTHONWARNINGS=default::ImportWarning:ale_py.roms`. For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management
(RolloutWorker pid=1288) 
(RolloutWorker pid=1288) During handling of the above exception, another exception occurred:
(RolloutWorker pid=1288) 
(RolloutWorker pid=1288) ray::RolloutWorker.__init__() (pid=1288, ip=172.31.127.231)
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(RolloutWorker pid=1288)     self.env = env_creator(copy.deepcopy(self.env_context))
(RolloutWorker pid=1288)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/env/utils.py", line 54, in gym_env_creator
(RolloutWorker pid=1288)     raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
(RolloutWorker pid=1288) ray.rllib.utils.error.EnvError: The env string you provided ('Pong-v0') is:
(RolloutWorker pid=1288) a) Not a supported/installed environment.
(RolloutWorker pid=1288) b) Not a tune-registered environment creator.
(RolloutWorker pid=1288) c) Not a valid env class string.
(RolloutWorker pid=1288) 
(RolloutWorker pid=1288) Try one of the following:
(RolloutWorker pid=1288) a) For Atari support: `pip install gym[atari] atari_py`.
(RolloutWorker pid=1288)    For VizDoom support: Install VizDoom
(RolloutWorker pid=1288)    (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
(RolloutWorker pid=1288)    `pip install vizdoomgym`.
(RolloutWorker pid=1288)    For PyBullet support: `pip install pybullet`.
(RolloutWorker pid=1288) b) To register your custom env, do `from ray import tune;
(RolloutWorker pid=1288)    tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
(RolloutWorker pid=1288)    Then in your config, do `config['env'] = [name]`.
(RolloutWorker pid=1288) c) Make sure you provide a fully qualified classpath, e.g.:
(RolloutWorker pid=1288)    `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
== Status ==
Current time: 2021-11-02 17:24:54 (running for 00:00:00.16)
Memory usage on this node: 1.9/30.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/63.21 GiB heap, 0.0/27.47 GiB objects
Result logdir: /home/ray/ray_results/apex
Number of trials: 1/1 (1 PENDING)
+--------------------------+----------+-------+
| Trial name               | status   | loc   |
|--------------------------+----------+-------|
| APEX_Pong-v0_77902_00000 | PENDING  |       |
+--------------------------+----------+-------+

Result for APEX_Pong-v0_77902_00000:
  trial_id: '77902_00000'

== Status ==
Current time: 2021-11-02 17:24:58 (running for 00:00:03.85)
Memory usage on this node: 2.4/30.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/63.21 GiB heap, 0.0/27.47 GiB objects
Result logdir: /home/ray/ray_results/apex
Number of trials: 1/1 (1 ERROR)
+--------------------------+----------+-------+
| Trial name               | status   | loc   |
|--------------------------+----------+-------|
| APEX_Pong-v0_77902_00000 | ERROR    |       |
+--------------------------+----------+-------+
Number of errored trials: 1
+--------------------------+--------------+-------------------------------------------------------------------------------------+
| Trial name               |   # failures | error file                                                                          |
|--------------------------+--------------+-------------------------------------------------------------------------------------|
| APEX_Pong-v0_77902_00000 |            1 | /home/ray/ray_results/apex/APEX_Pong-v0_77902_00000_0_2021-11-02_17-24-54/error.txt |
+--------------------------+--------------+-------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "workloads/apex.py", line 51, in <module>
    callbacks=[ProgressCallback()])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 735, in run_experiments
    callbacks=callbacks).trials
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 627, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [APEX_Pong-v0_77902_00000])
(APEX pid=1255) [2021-11-02 17:24:58,583 E 1255 1255] core_worker.cc:191: The core worker process is not initialized yet or already shutdown.
rkooo567 commented 3 years ago

For Atari, it seems like there's an application change?

(RolloutWorker pid=1286) Try one of the following:
(RolloutWorker pid=1286) a) For Atari support: `pip install gym[atari] atari_py`.
(RolloutWorker pid=1286)    For VizDoom support: Install VizDoom
(RolloutWorker pid=1286)    (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
(RolloutWorker pid=1286)    `pip install vizdoomgym`.
(RolloutWorker pid=1286)    For PyBullet support: `pip install pybullet`.
(RolloutWorker pid=1286) b) To register your custom env, do `from ray import tune;
(RolloutWorker pid=1286)    tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
(RolloutWorker pid=1286)    Then in your config, do `config['env'] = [name]`.
(RolloutWorker pid=1286) c) Make sure you provide a fully qualified classpath, e.g.:
(RolloutWorker pid=1286)    `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`

Maybe we should just ask ML team to look for it? (since all core tests seem to pass now with your change)

fishbone commented 3 years ago

@richardliaw could you help find the right person to take a look at this?

richardliaw commented 3 years ago

@gjoliver take a look?

gjoliver commented 3 years ago

seems like problems with environment setup. missing tensorflow and atari ROMs. let me ping some folks and figure out what environments they use.

fishbone commented 3 years ago

@gjoliver in case you need more info, here is the session

https://beta.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_PkQvVV22pVuxWbnn9P1qEgBP?command-history-section=command_history

https://beta.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_hL48MS4GhQ3v7394DwCqy8fE?command-history-section=command_history

gjoliver commented 3 years ago

@iycheng sent you a PR