huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
845 stars 176 forks source link

Race in Scheduler. #124

Closed akamaus closed 2 years ago

akamaus commented 3 years ago

I have been experimenting with ScriptRunner and repeatingly stumbled upon errors like this one:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 176, in _monitor_thread
    worker_info_list = master._pop_all_finished_worker()
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 265, in _pop_all_finished_worker
    finished_train_worker_info = self._pop_finished_worker()
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 243, in _pop_finished_worker
    self.update_status()
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 212, in update_status
    t_pid, _ = self.md.result_queue_get()
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/distribution.py", line 131, in result_queue_get
    self.update_queues()
  File "/home/vyal/venvs/vega/lib/python3.7/site-packages/vega/core/scheduler/distribution.py", line 104, in update_queues
    for f in self.future_set:
RuntimeError: Set changed size during iteration

I believe there is some race in scheduler implementation which is triggered by my config because time to evaluate candidate sample is very small.

general:
    backend: pytorch
    parallel_search: True
    parallel_fully_train: False

pipeline: [hpo]

hpo:
    pipe_step:
        type: SearchPipeStep
    search_algorithm:
        type: RandomSearch 
        objective_keys: 'dist'

    search_space:
        type: SearchSpace
        hyperparameters:
            -   key: trainer.param_x
                type: float
                range: [-10, 10]
            -   key: trainer.param_y
                type: float
                range: [-10, 10]

    trainer:
        type: ScriptRunner
        script: "dummy.py"

and here is dummy.py:

import logging
import math
import numpy as np
import random
from zeus.trainer.trial_agent import TrialAgent

logging.info("load trial")
trial = TrialAgent()

logging.info("create model")

logging.info(f"hps detected: {trial.hps}")
hps = trial.hps

dist = math.sqrt((hps['trainer']['param_x'] - 3.14159) ** 2 + (hps['trainer']['param_y'] + 3.14159) ** 2) + \
       np.random.randn()  / hps['trainer'].get('epochs', 1)
accuracy = (20 - dist) / 20

logging.info(f"accuracy: {accuracy}")
result = trial.update(performance={"accuracy": accuracy, "dist": -dist})
logging.info(f"have sent to server, record: {result}")
zhangjiajin commented 3 years ago

@akamaus

This bug exists in distributed.py: when future_set and process_queue are updated, these two values are already updated in another function. The solution is to use a public lock.

    def distribute(self, client, pid, func, kwargs):
        with self._queue_lock:
            future = client.submit(func, **kwargs)
            f = (pid, future)
            self.future_set.add(f)
            self.process_queue.put(pid)

This bug has been fixed in the latest version. Please try it out.