microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.04k stars 1.81k forks source link

view experiment failed #3543

Closed JianJuly closed 3 years ago

JianJuly commented 3 years ago

Environment:

Log message:

What issue meet, what's expected?: when i use command 'nnictl view n6XH4aFU' to view a stopped experiment, the issue arose.

How to reproduce it?: nnictl view n6XH4aFU

Additional information:

INFO:  view experiment n6XH4aFU...
INFO:  Starting restful server...
ERROR: Restful server start failed!
INFO:   Stdout:
-----------------------------------------------------------------------
                Experiment start time 2021-04-18 21:32:09
-----------------------------------------------------------------------
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 09:50:47
-----------------------------------------------------------------------

INFO:   Stderr:
-----------------------------------------------------------------------
                Experiment start time 2021-04-18 21:32:09
-----------------------------------------------------------------------
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/json_tricks/nonp.py:225: JsonTricksDeprecation: `json_tricks.load(s)` stripped some comments, but `ignore_comments` was not passed; in the next major release, the behaviour when `ignore_comments` is not passed will change; it is recommended to explicitly pass `ignore_comments=True` if you want to strip comments; see https://github.com/mverleg/pyjson_tricks/issues/74
  JsonTricksDeprecation)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:242: RuntimeWarning: divide by zero encountered in true_divide
  return c - a / np.log(x)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/scipy/optimize/minpack.py:829: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:242: RuntimeWarning: divide by zero encountered in double_scalars
  return c - a / np.log(x)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:171: RuntimeWarning: invalid value encountered in power
  return c - (a*x+b)**-alpha
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:196: RuntimeWarning: invalid value encountered in power
  return alpha - (alpha - beta) / (1. + (kappa * x)**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:124: RuntimeWarning: invalid value encountered in double_scalars
  return (theta * x**eta) / (kappa**eta + x**eta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:267: RuntimeWarning: invalid value encountered in power
  return alpha - (alpha - beta) * np.exp(-(kappa * x)**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:291: RuntimeWarning: overflow encountered in exp
  return a - (a - beta) * np.exp(-k*x**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/model_factory.py:297: RuntimeWarning: invalid value encountered in true_divide
  alpha = np.minimum(1, self.target_distribution(new_values) / self.target_distribution(self.weight_samples))
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:220: RuntimeWarning: overflow encountered in exp
  return c - np.exp(-a*(x**alpha)+b)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:171: RuntimeWarning: invalid value encountered in double_scalars
  return c - (a*x+b)**-alpha
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: overflow encountered in true_divide
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: overflow encountered in exp
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: divide by zero encountered in power
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/model_factory.py:297: RuntimeWarning: divide by zero encountered in true_divide
  alpha = np.minimum(1, self.target_distribution(new_values) / self.target_distribution(self.weight_samples))
Error: Dispatcher stream error, tuner may have crashed.
    at EventEmitter.dispatcher.onError (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nnimanager.js:550:32)
    at EventEmitter.emit (events.js:198:13)
    at Socket.IpcInterface.outgoingStream.on (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/ipcInterface.js:42:72)
    at Socket.emit (events.js:198:13)
    at emitErrorNT (internal/streams/destroy.js:91:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
    at process._tickCallback (internal/process/next_tick.js:63:19)
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 09:50:47
-----------------------------------------------------------------------
Failed to create log dir: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert(fs.existsSync(dbDir))

    at SqlDB.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/sqlDatabase.js:72:9)
    at NNIDataStore.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nniDataStore.js:35:21)
    at initContainer (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:87:14)
    at utils_1.mkDirP.then (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:146:15)
J-shang commented 3 years ago

Hello @JianJuly , will this #3495 fix your issue for a workaround? If this does not work, please tell us. view experiment launched by python will be fully supported in nni v2.2.

JianJuly commented 3 years ago

I found n6XH4aFU metadata in my ~/nni-experiments/.experiment path, it looks like this

    "0QEwk2bg": {
        "id": "0QEwk2bg",
        "port": 8080,
        "startTime": 1618746646923,
        "endTime": "N/A",
        "status": "STOPPED",
        "platform": "local",
        "experimentName": "LN_MS",
        "tag": [],
        "pid": 34344,
        "webuiUrl": [
            "http://127.0.0.1:8080",
            "http://10.2.4.60:8080",
            "http://10.147.20.35:8080"
        ],
        "logDir": "/home/jianjunming/projects/CRLNMS_clf_AutoML/configs/../checkpoints/nni-experiments"
    },
    "n6XH4aFU": {
        "id": "n6XH4aFU",
        "port": 8080,
        "startTime": 1618813694427,
        "endTime": "N/A",
        "status": "INITIALIZED",
        "platform": "local",
        "experimentName": "LN_MS",
        "tag": [],
        "pid": 36252,
        "webuiUrl": [],
        "logDir": "/home/jianjunming/nni-experiments"
    }
}

Then i set the logDir of n6XH4aFU to the right value /home/jianjunming/projects/CRLNMS_clf_AutoML/configs/../checkpoints/nni-experiments, and save it.

I tried nnictl view n6XH4aFU again, and the issue arose again :(

After that i checked ~/nni-experiments/.experiment, and found out that the logDir became /home/jianjunming/nni-experiments again :(

So weird

Here are the outputs of the terminal

INFO:  view experiment n6XH4aFU...
INFO:  Starting restful server...
ERROR: Restful server start failed!
INFO:   Stdout:
-----------------------------------------------------------------------
                Experiment start time 2021-04-18 21:32:09
-----------------------------------------------------------------------
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 09:50:47
-----------------------------------------------------------------------
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:18:58
-----------------------------------------------------------------------
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:20:57
-----------------------------------------------------------------------
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:28:14
-----------------------------------------------------------------------

INFO:   Stderr:
-----------------------------------------------------------------------
                Experiment start time 2021-04-18 21:32:09
-----------------------------------------------------------------------
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/json_tricks/nonp.py:225: JsonTricksDeprecation: `json_tricks.load(s)` stripped some comments, but `ignore_comments` was not passed; in the next major release, the behaviour when `ignore_comments` is not passed will change; it is recommended to explicitly pass `ignore_comments=True` if you want to strip comments; see https://github.com/mverleg/pyjson_tricks/issues/74
  JsonTricksDeprecation)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:242: RuntimeWarning: divide by zero encountered in true_divide
  return c - a / np.log(x)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/scipy/optimize/minpack.py:829: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:242: RuntimeWarning: divide by zero encountered in double_scalars
  return c - a / np.log(x)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:171: RuntimeWarning: invalid value encountered in power
  return c - (a*x+b)**-alpha
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:196: RuntimeWarning: invalid value encountered in power
  return alpha - (alpha - beta) / (1. + (kappa * x)**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:124: RuntimeWarning: invalid value encountered in double_scalars
  return (theta * x**eta) / (kappa**eta + x**eta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:267: RuntimeWarning: invalid value encountered in power
  return alpha - (alpha - beta) * np.exp(-(kappa * x)**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:291: RuntimeWarning: overflow encountered in exp
  return a - (a - beta) * np.exp(-k*x**delta)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/model_factory.py:297: RuntimeWarning: invalid value encountered in true_divide
  alpha = np.minimum(1, self.target_distribution(new_values) / self.target_distribution(self.weight_samples))
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:220: RuntimeWarning: overflow encountered in exp
  return c - np.exp(-a*(x**alpha)+b)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:171: RuntimeWarning: invalid value encountered in double_scalars
  return c - (a*x+b)**-alpha
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: overflow encountered in true_divide
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: overflow encountered in exp
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/curvefunctions.py:147: RuntimeWarning: divide by zero encountered in power
  return a/(1.+(x/np.exp(b))**c)
/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni/algorithms/hpo/curvefitting_assessor/model_factory.py:297: RuntimeWarning: divide by zero encountered in true_divide
  alpha = np.minimum(1, self.target_distribution(new_values) / self.target_distribution(self.weight_samples))
Error: Dispatcher stream error, tuner may have crashed.
    at EventEmitter.dispatcher.onError (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nnimanager.js:550:32)
    at EventEmitter.emit (events.js:198:13)
    at Socket.IpcInterface.outgoingStream.on (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/ipcInterface.js:42:72)
    at Socket.emit (events.js:198:13)
    at emitErrorNT (internal/streams/destroy.js:91:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
    at process._tickCallback (internal/process/next_tick.js:63:19)
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 09:50:47
-----------------------------------------------------------------------
Failed to create log dir: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert(fs.existsSync(dbDir))

    at SqlDB.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/sqlDatabase.js:72:9)
    at NNIDataStore.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nniDataStore.js:35:21)
    at initContainer (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:87:14)
    at utils_1.mkDirP.then (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:146:15)
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:18:58
-----------------------------------------------------------------------
Failed to create log dir: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert(fs.existsSync(dbDir))

    at SqlDB.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/sqlDatabase.js:72:9)
    at NNIDataStore.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nniDataStore.js:35:21)
    at initContainer (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:87:14)
    at utils_1.mkDirP.then (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:146:15)
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:20:57
-----------------------------------------------------------------------
Failed to create log dir: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert(fs.existsSync(dbDir))

    at SqlDB.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/sqlDatabase.js:72:9)
    at NNIDataStore.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nniDataStore.js:35:21)
    at initContainer (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:87:14)
    at utils_1.mkDirP.then (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:146:15)
-----------------------------------------------------------------------
                Experiment start time 2021-04-19 14:28:14
-----------------------------------------------------------------------
Failed to create log dir: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  assert(fs.existsSync(dbDir))

    at SqlDB.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/sqlDatabase.js:72:9)
    at NNIDataStore.init (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/core/nniDataStore.js:35:21)
    at initContainer (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:87:14)
    at utils_1.mkDirP.then (/home/jianjunming/anaconda3/envs/pytorch/lib/python3.6/site-packages/nni_node/main.js:146:15)
J-shang commented 3 years ago

Oh yes, it is a bug and will fix in nni v2.2, FYI #3545 . Now, you can add

experiment_config['logDir'] = experiments_dict[args.id]['logDir']

in ./site-packages/nni/tools/nnictl/launcher.py L636 for workaround.

JianJuly commented 3 years ago

This works, thank you!!! Looking forward to V2.2