ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[CI] `linux://rllib:test_apex_ddpg` is failing/flaky on master. #29339

Closed rickyyx closed 1 year ago

rickyyx commented 1 year ago

.... Generated from flaky test tracker. Please do not edit the signature in this section. DataCaseName-linux://rllib:test_apex_ddpg-END ....

rickyyx commented 1 year ago

Marking it as release blocker for now since it is showing up in the release branch CI as well: https://buildkite.com/ray-project/oss-ci-build-pr/builds/2138#0183d2a7-1715-49d7-afb9-1c1bae966d09

gjoliver commented 1 year ago

I looked at this test. I don't know what we can do tbh. when it passes, it passes. when it times out, pytest just hangs at "collecting tests ..." without any error logs.

gjoliver commented 1 year ago

I am willing to declare this non-release blocking.

rickyyx commented 1 year ago

@gjoliver Looks like there are some failure output w.r.t this test in this build, maybe this will help root-causing.

gjoliver commented 1 year ago

I don't see anything 😢 since this has been flaky for at least a quarter now, I am gonna say this is not release blocking for now. thanks man.

rickyyx commented 1 year ago

Oh, did you see this?

Test failure ``` =================================== FAILURES =================================== --   | ____ TestApexDDPG.test_apex_ddpg_compilation_and_per_worker_epsilon_values _____   |     | self =   |     | def test_apex_ddpg_compilation_and_per_worker_epsilon_values(self):   | """Test whether APEX-DDPG can be built on all frameworks."""   | config = (   | apex_ddpg.ApexDDPGConfig()   | .rollouts(num_rollout_workers=2)   | .reporting(min_sample_timesteps_per_iteration=100)   | .training(   | num_steps_sampled_before_learning_starts=0,   | optimizer={"num_replay_buffer_shards": 1},   | )   | .environment(env="Pendulum-v1")   | )   |     | num_iterations = 1   |     | for _ in framework_iterator(config, with_eager_tracing=True):   | trainer = config.build()   |     | # Test per-worker scale distribution.   | infos = trainer.workers.foreach_policy(   | lambda p, _: p.get_exploration_state()   | )   | scale = [i["cur_scale"] for i in infos]   | expected = [   | 0.4 ** (1 + (i + 1) / float(config.num_workers - 1) * 7)   | for i in range(config.num_workers)   | ]   | check(scale, [0.0] + expected)   |     | for _ in range(num_iterations):   | > results = trainer.train()   |     | rllib/algorithms/apex_ddpg/tests/test_apex_ddpg.py:51:   | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _   | /ray/python/ray/tune/trainable/trainable.py:352: in train   | result = self.step()   | /ray/python/ray/rllib/algorithms/algorithm.py:772: in step   | results, train_iter_ctx = self._run_one_training_iteration()   | /ray/python/ray/rllib/algorithms/algorithm.py:2944: in _run_one_training_iteration   | results = self.training_step()   | /ray/python/ray/rllib/algorithms/apex_ddpg/apex_ddpg.py:192: in training_step   | return ApexDQN.training_step(self)   | /ray/python/ray/rllib/algorithms/apex_dqn/apex_dqn.py:457: in training_step   | return copy.deepcopy(self.learner_thread.learner_info)   | /opt/miniconda/lib/python3.7/copy.py:150: in deepcopy   | y = copier(x, memo)   | /opt/miniconda/lib/python3.7/copy.py:241: in _deepcopy_dict   | y[deepcopy(key, memo)] = deepcopy(value, memo)   | /opt/miniconda/lib/python3.7/copy.py:184: in deepcopy   | memo[d] = y   | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _   |     | signum = 15   | frame =   |     | def sigterm_handler(signum, frame):   | > sys.exit(signum)   | E SystemExit: 15   |     | /ray/python/ray/_private/worker.py:1628: SystemExit ```
gjoliver commented 1 year ago

huh, thanks I didn't notice it. taking a look.

gjoliver commented 1 year ago

job received a sigterm from where ...

rickyyx commented 1 year ago

no more relevant?