ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[Placement Group] The bundle_reservation_check_func breaks load code from local #15840

Open fyrestone opened 3 years ago

fyrestone commented 3 years ago

What is the problem?

2021-05-17 17:29:07,439 WARNING worker.py:1102 -- Traceback (most recent call last):
  File "/xxx/ray/python/ray/_private/function_manager.py", line 274, in _load_function_from_local
    object = getattr(object, part)
AttributeError: 'function' object has no attribute '<locals>'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 594, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 399, in ray._raylet.execute_task
  File "/xxx/ray/python/ray/_private/function_manager.py", line 239, in get_execution_info
    self._load_function_from_local(job_id, function_descriptor)
  File "/xxx/ray/python/ray/_private/function_manager.py", line 284, in _load_function_from_local
    raise RuntimeError(f"Function {function_descriptor} failed "
RuntimeError: Function {type=PythonFunctionDescriptor, module_name=ray.util.placement_group, class_name=, function_name=_export_bundle_reservation_check_method_if_needed.<locals>.bundle_reservation_check_func, function_hash=bf25b020ba2fb69467437247487873d7f660dea6765ec07d9a0ff059} failed to be loaded from local code.
sys.path: ['', '/xxx/ray/python/ray/tests', '/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pycharm', '/xxx/ray/python/ray/thirdparty_files', '/xxx/ray/python/ray/pickle5_files', '/xxx/ray/python/ray/workers', '/xxx/ray/python', '/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pycharm', '/Users/yyy/.pyenv/versions/3.8.2/lib/python38.zip', '/Users/yyy/.pyenv/versions/3.8.2/lib/python3.8', '/Users/yyy/.pyenv/versions/3.8.2/lib/python3.8/lib-dynload', '/Users/yyy/.pyenv/versions/3.8.2/lib/python3.8/site-packages', '/xxx/work'], Error message: 'function' object has no attribute '<locals>'
An unexpected internal error occurred while the worker was executing a task.
2021-05-17 17:29:07,439 WARNING worker.py:1102 -- A worker died or was killed while executing task a67dc375e60ddd1affffffffffffffffffffffff01000000.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.3.0

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

def test_placement_group_local_code(ray_start_cluster):
    cluster = ray_start_cluster
    num_nodes = 2
    for _ in range(num_nodes):
        cluster.add_node(num_cpus=4)
    ray.init(address=cluster.address, job_config=ray.job_config.JobConfig(code_search_path=[""]))
    placement_group = ray.util.placement_group(
            name="name",
            strategy="PACK",
            bundles=[
                {
                    "CPU": 2,
                    "GPU": 0  # Test 0 resource spec doesn't break tests.
                },
                {
                    "CPU": 2
                }
            ])
    ray.get(placement_group.ready())

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

fyrestone commented 3 years ago

Maybe we need a mix mode to load code from local code and deserialization.