huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
842 stars 175 forks source link

Run distributed task with exist dask cluster #225

Closed Stasolet closed 2 years ago

Stasolet commented 2 years ago

Hi, I want to run cars example with existing dask cluster. I added to the file with cars example

general:
    parallel_search: True
    parallel_fully_train: True
    devices_per_trainer: 2
    backend: pytorch  # pytorch
    cluster:
        listen_port: 28500

And get this promt. I was replace some duplicate with "..."

2022-03-16 14:24:38.875 INFO ------------------------------------------------
2022-03-16 14:25:10.967 INFO ------------------------------------------------
2022-03-16 14:25:11.162 INFO   Step: nas
2022-03-16 14:25:11.162 INFO ------------------------------------------------
2022-03-16 14:25:17.972 INFO master ip and port: 127.0.0.1:28500
2022-03-16 14:25:18.230 INFO Initializing cluster. Please wait.
2022-03-16 14:25:18.483 INFO Reusing previous cluster:127.0.0.1:28500
/opt/conda/lib/python3.7/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39509 instead
  f"Port {expected} is already in use.\n"
distributed.diskutils - INFO - Found stale lock file and directory '/workspace/proj/vega_test/vega/dask-worker-space/worker-jebbaedu', purging
/opt/conda/lib/python3.7/contextlib.py:119: UserWarning: Creating scratch directories is taking a surprisingly long time. This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
...
/opt/conda/lib/python3.7/contextlib.py:119: UserWarning: Creating scratch directories is taking a surprisingly long time. This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
distributed.diskutils - INFO - Found stale lock file and directory '/workspace/proj/vega_test/vega/dask-worker-space/worker-misy1p53', purging
...
distributed.diskutils - INFO - Found stale lock file and directory '/workspace/proj/vega_test/vega/dask-worker-space/worker-hkfdgsei', purging
/opt/conda/lib/python3.7/contextlib.py:119: UserWarning: Creating scratch directories is taking a surprisingly long time. This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
2022-03-16 14:32:04.524 INFO Accessed Workers: 8
2022-03-16 14:32:05.153 INFO worker list: ['tcp://127.0.0.1:33309', 'tcp://127.0.0.1:37607', 'tcp://127.0.0.1:37905', 'tcp://127.0.0.1:38899', 'tcp://127.0.0.1:39767', 'tcp://127.0.0.1:42459', 'tcp://127.0.0.1:43207', 'tcp://127.0.0.1:45843']
2022-03-16 14:32:05.156 INFO Dask Server Start Success!
/opt/conda/lib/python3.7/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 32947 instead
  f"Port {expected} is already in use.\n"
/opt/conda/lib/python3.7/contextlib.py:119: UserWarning: Creating scratch directories is taking a surprisingly long time. This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
...
/opt/conda/lib/python3.7/contextlib.py:119: UserWarning: Creating scratch directories is taking a surprisingly long time. This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
ERROR:vega.core.pipeline.pipeline:Failed to run pipeline, message: '34251' is not in list
ERROR:vega.core.pipeline.pipeline:Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 84, in run
    pipestep = PipeStep(name=step_name)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/pipeline/search_pipe_step.py", line 45, in __init__
    self.master = create_master(update_func=self.generator.update)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/master_ops.py", line 44, in create_master
    __master_instance__ = Master(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 68, in __init__
    self._start_cluster()
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 101, in _start_cluster
    self.client.register_worker_plugin(plugin)
  File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 4688, in register_worker_plugin
    self._register_worker_plugin, plugin=plugin, name=name, nanny=nanny
  File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 310, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 376, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 349, in f
    result = yield future
  File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 4599, in _register_worker_plugin
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/worker_env.py", line 140, in setup
    self._set_visible_devices(worker)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/worker_env.py", line 77, in _set_visible_devices
    _index = self._get_device_index(worker)
  File "/opt/conda/lib/python3.7/site-packages/vega/core/scheduler/worker_env.py", line 129, in _get_device_index
    _index = ports_list[ip].index(port)
ValueError: '34251' is not in list

lib verison

And if I try to run examples from modnas, i have

2022-03-18 20:17:06.402 ERROR Failed to run pipeline, message: id 'optim.file' not found in registry
2022-03-18 20:17:07.391 ERROR Traceback (most recent call last):
  File "/workspace/proj/vega_test/vega/vega/core/pipeline/pipeline.py", line 86, in run
    pipestep.do()
  File "/workspace/proj/vega_test/vega/vega/core/pipeline/search_pipe_step.py", line 61, in do
    self._dispatch_trainer(res)
  File "/workspace/proj/vega_test/vega/vega/core/pipeline/search_pipe_step.py", line 79, in _dispatch_trainer
    self.master.run(trainer, evaluator)
  File "/workspace/proj/vega_test/vega/vega/core/scheduler/master.py", line 165, in run
    "num_workers": len(workers)})
  File "/workspace/proj/vega_test/vega/vega/core/scheduler/distribution.py", line 185, in distribute
    future = client.submit(func, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 1752, in submit
    actors=actor,
  File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 2896, in _graph_to_futures
    "code": self._get_computation_code(),
  File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 2824, in _get_computation_code
    return inspect.getsource(fr)
  File "/opt/conda/lib/python3.7/inspect.py", line 973, in getsource
    lines, lnum = getsourcelines(object)
  File "/opt/conda/lib/python3.7/inspect.py", line 955, in getsourcelines
    lines, lnum = findsource(object)
  File "/opt/conda/lib/python3.7/inspect.py", line 780, in findsource
    module = getmodule(object, file)
  File "/opt/conda/lib/python3.7/inspect.py", line 733, in getmodule
    if ismodule(module) and hasattr(module, '__file__'):
  File "/workspace/proj/vega_test/vega/vega/algorithms/nas/modnas/registry/__init__.py", line 121, in __getattr__
    return self.get_builder(attr)
  File "/workspace/proj/vega_test/vega/vega/algorithms/nas/modnas/registry/__init__.py", line 42, in get_builder
    return registry.get(_reg_path, _reg_id)
  File "/workspace/proj/vega_test/vega/vega/algorithms/nas/modnas/registry/registry.py", line 54, in get
    raise ValueError('id \'{}\' not found in registry'.format(reg_id))
ValueError: id 'optim.file' not found in registry
zhangjiajin commented 2 years ago

@Stasolet

Currently, the CARS, DARTS, and MODNAS algorithms do not support parallel search. Parallel search is used for sample-base algorithms. Except the preceding three algorithms, other algorithms are based on sample-base. Sorry to waste your time solving problems. Your ability to find bugs is great. We will update the document as soon as possible.