ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

OSS CI `test_object_assign_owner_client_mode ` failure #39540

Closed GeneDer closed 6 months ago

GeneDer commented 1 year ago

2/6 :windows: Build & Test https://buildkite.com/ray-project/oss-ci-build-branch/builds/6072#018a7fbf-97ed-497a-9487-a9fd303144de/4526-7780

GeneDer commented 1 year ago

Related log: https://buildkite.com/ray-project/oss-ci-build-branch/builds/6072#018a7fbf-97ed-497a-9487-a9fd303144de/4272-4379


[Errno 2] No such file or directory: '::test_owner_assign_inner_object.txt'
--
  | FAILED
  |  
  | ================================== FAILURES ===================================
  | _______________________ test_owner_assign_inner_object ________________________
  |  
  | shutdown_only = None
  |  
  | def test_owner_assign_inner_object(shutdown_only):
  |  
  | ray.init()
  |  
  | @ray.remote
  | class Owner:
  | def warmup(self):
  | pass
  |  
  | @ray.remote
  | def get_borrowed_object():
  | ref = ray.put(("test_borrowed"))
  | return [ref]
  |  
  | owner = Owner.remote()
  | ray.get(owner.warmup.remote())
  |  
  | class OutObject:
  | def __init__(self, owned_inner_ref, borrowed_inner_ref):
  | self.owned_inner_ref = owned_inner_ref
  | self.borrowed_inner_ref = borrowed_inner_ref
  |  
  | owned_inner_ref = ray.put("test_owned")
  |  
  | borrowed_inner_ref = ray.get(get_borrowed_object.remote())[0]
  | out_ref = ray.put(OutObject(owned_inner_ref, borrowed_inner_ref), _owner=owner)
  |  
  | # wait enough time to delete data when the reference count is lower
  | # than expected
  | del owned_inner_ref, borrowed_inner_ref
  | time.sleep(10)
  |  
  | assert ray.get(ray.get(out_ref).owned_inner_ref) == "test_owned"
  | >       assert ray.get(ray.get(out_ref).borrowed_inner_ref) == "test_borrowed"
  |  
  | \\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_af6igswj\runfiles\com_github_ray_project_ray\python\ray\tests\test_object_assign_owner.py:191:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | c:\install\ray\python\ray\_private\auto_init_hook.py:24: in auto_init_wrapper
  | return fn(*args, **kwargs)
  | c:\install\ray\python\ray\_private\auto_init_hook.py:24: in auto_init_wrapper
  | return fn(*args, **kwargs)
  | c:\install\ray\python\ray\_private\client_mode_hook.py:102: in wrapper
  | return getattr(ray, func.__name__)(*args, **kwargs)
  | c:\install\ray\python\ray\util\client\api.py:42: in get
  | return self.worker.get(vals, timeout=timeout)
  | c:\install\ray\python\ray\util\client\worker.py:434: in get
  | res = self._get(to_get, op_timeout)
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | self = <ray.util.client.worker.Worker object at 0x0000021A6DBA1D00>
  | ref = [ClientObjectRef(000c6522fa08b41fffffffffffffffffffffffff0100000002e1f505)]
  | timeout = 2.0
  |  
  | def _get(self, ref: List[ClientObjectRef], timeout: float):
  | req = ray_client_pb2.GetRequest(ids=[r.id for r in ref], timeout=timeout)
  | data = bytearray()
  | try:
  | resp = self._get_object_iterator(req, metadata=self.metadata)
  | for chunk in resp:
  | if not chunk.valid:
  | try:
  | err = cloudpickle.loads(chunk.error)
  | except (pickle.UnpicklingError, TypeError):
  | logger.exception("Failed to deserialize {}".format(chunk.error))
  | raise
  | >                   raise err
  | E                   ValueError: ClientObjectRef b'\x00\x0ce"\xfa\x08\xb4\x1f\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\x00\x00\x00\x02\xe1\xf5\x05' is not found for client 341456133f5d44e9bfe6254632095c0d
  |  
  | c:\install\ray\python\ray\util\client\worker.py:462: ValueError
  | ============================== warnings summary ===============================
  | ::test_owner_assign_bug
  | C:\Miniconda3\lib\site-packages\_pytest\threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread ray_print_logs
  |  
  | Traceback (most recent call last):
  | File "C:\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner
  | self.run()
  | File "C:\Miniconda3\lib\threading.py", line 870, in run
  | self._target(*self._args, **self._kwargs)
  | File "c:\install\ray\python\ray\_private\worker.py", line 819, in print_logs
  | global_worker_stdstream_dispatcher.emit(data)
  | File "c:\install\ray\python\ray\_private\ray_logging.py", line 181, in emit
  | handle(data)
  | File "c:\install\ray\python\ray\_private\worker.py", line 1796, in print_to_stdstream
  | print_worker_logs(batch, sink)
  | File "c:\install\ray\python\ray\_private\worker.py", line 1954, in print_worker_logs
  | for line in lines:
  | File "c:\install\ray\python\ray\_private\worker.py", line 1832, in filter_autoscaler_events
  | if is_autoscaler_v2():
  | File "c:\install\ray\python\ray\autoscaler\v2\utils.py", line 547, in is_autoscaler_v2
  | raise Exception(
  | Exception: GCS address could not be resolved (e.g. ray.init() not called)
  |  
  | warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
  |  
  | -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | =========================== short test summary info ===========================
  | FAILED ::test_owner_assign_inner_object - ValueError: ClientObjectRef b'\x00\...
  | ============= 1 failed, 1 passed, 6 skipped, 1 warning in 29.71s ==============
  | (16:45:48) FAIL: //python/ray/tests:test_object_assign_owner_client_mode (see C:/tmp/4lhdprva/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_object_assign_owner_client_mode/test_attempts/attempt_1.log)
GeneDer commented 1 year ago

CC: @xieus @rkooo567 just in case if Jiajun is OOO, can you find an owner for this?

anyscalesam commented 6 months ago

@jjyao would/should this failing CI block weekly-release-blocker ? cc @can-anyscale

can-anyscale commented 6 months ago

this is a windows test so probably not a weekly release blocker

mattip commented 6 months ago

The test is marked skip_flaky_core_test_premerge. After unskipping it, the failure did not reproduce when run many times on HEAD, even when I made the time.sleep(20) to be sure the item was (potentially) wrongly collected. The test was added in #30415 with a fix for issue #30341. It was flaky on windows, but I can't get it to fail. Maybe something in the timing of the call to reference_counter_->AddNestedObjectIds has changed for the better and the objects are not being collected.

anyscalesam commented 6 months ago

@can-anyscale opening/re-opening and closing windows tickets are also enabled in the ci-bot for both ci and release tests right? can we automate to create these gh tickets as p2s so if we eventually want to look at ci-test health for windows holistically we'll have that data. cc @jjyao