microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.99k stars 1.81k forks source link

Having trouble nni with AdaptDL on K8s (on-premise) #4610

Closed juniroc closed 2 years ago

juniroc commented 2 years ago

when I try nni with AdaptDL on k8s (on-premise)

I could create pod to train and it train model well like below

image


and I checked the web ui.

It showed Hyper-parameter and Duration

image

image


but It doesn't show metric and intermediate result

image

image


this is report_result part of mnist.py mnist.py

def test(...):

...

    logger.info('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset), accuracy))

    result_dict={'default': accuracy, 'accuracy': accuracy}

    print("==========@@@@@@@@@@@=================")
    print(os.system("echo $NNI_PLATFORM"))
    print(result_dict)
    os.environ["NNI_PLATFORM"] = "local"
    print("-------reuse_mode-------")
    print(os.system("echo $REUSE_MODE"))
    os.environ["REUSE_MODE"] = "False"
    print("----- change -----")
    print(os.system("echo $REUSE_MODE"))
    print("-------NNI_PLATFORM---------")
    print(os.system("echo $NNI_PLATFORM"))
    nni.report_intermediate_result(accuracy)

    return accuracy

this is Dispatcher logs

Distpatcher log

[2022-03-03 17:35:26] DEBUG (nni.main/MainThread) START
[2022-03-03 17:35:26] DEBUG (nni.main/MainThread) decoded exp_params: [{"authorName":"default","experimentName":"example_mnist_pytorch","trialConcurrency":1,"maxExecDuration":3600,"maxExperimentDuration":"3600s","maxTrialNum":10,"maxTrialDuration":"999d","maxTrialNumber":10,"trainingServicePlatform":"adl","trainingService":{"platform":"adl"},"nniManagerIp":"210.114.89.130","tuner":{"builtinTunerName":"TPE","classArgs":{"optimize_mode":"maximize"}},"versionCheck":false,"logCollection":"http","clusterMetaData":[{"key":"trial_config","value":{"image":"zerooneai/mnist-nni:t0.0.9","command":"python3 ./mnist.py","codeDir":".","gpuNum":0,"cpuNum":3,"checkpoint":{"storageClass":"nfs-storageclass","storageSize":"10Gi"}}}]}]
[2022-03-03 17:35:26] DEBUG (nni.main/MainThread) exp_params json obj: [{
    "authorName": "default",
    "experimentName": "example_mnist_pytorch",
    "trialConcurrency": 1,
    "maxExecDuration": 3600,
    "maxExperimentDuration": "3600s",
    "maxTrialNum": 10,
    "maxTrialDuration": "999d",
    "maxTrialNumber": 10,
    "trainingServicePlatform": "adl",
    "trainingService": {
        "platform": "adl"
    },
    "nniManagerIp": "210.114.89.130",
    "tuner": {
        "builtinTunerName": "TPE",
        "classArgs": {
            "optimize_mode": "maximize"
        }
    },
    "versionCheck": false,
    "logCollection": "http",
    "clusterMetaData": [
        {
            "key": "trial_config",
            "value": {
                "image": "zerooneai/mnist-nni:t0.0.9",
                "command": "python3 ./mnist.py",
                "codeDir": ".",
                "gpuNum": 0,
                "cpuNum": 3,
                "checkpoint": {
                    "storageClass": "nfs-storageclass",
                    "storageSize": "10Gi"
                }
            }
        }
    ]
}]
[2022-03-03 17:35:26] INFO (nni.tuner.tpe/MainThread) Using random seed 1459953071
[2022-03-03 17:35:26] DEBUG (nni.runtime.msg_dispatcher/MainThread) Assessor is not configured
[2022-03-03 17:35:26] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'IN00000000000221']
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, data: [{"batch_size":{"_type":"choice","_value":[16,32,64,128]},"hidden_size":{"_type":"choice","_value":[128,256,512,1024]},"lr":{"_type":"choice","_value":[0.0001,0.001,0.01,0.1]},"momentum":{"_type":"uniform","_value":[0,1]}}]
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'PI00000000000000']
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, data: []
[2022-03-03 17:35:26] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.Initialize], data: [OrderedDict([('batch_size', OrderedDict([('_type', 'choice'), ('_value', [16, 32, 64, 128])])), ('hidden_size', OrderedDict([('_type', 'choice'), ('_value', [128, 256, 512, 1024])])), ('lr', OrderedDict([('_type', 'choice'), ('_value', [0.0001, 0.001, 0.01, 0.1])])), ('momentum', OrderedDict([('_type', 'uniform'), ('_value', [0, 1])]))])]
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/Thread-1) Sending command, data: [b'ID00000000000000']
[2022-03-03 17:35:26] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.Ping], data: []
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'GE00000000000001']
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/MainThread) Received command, data: [1]
[2022-03-03 17:35:26] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.RequestTrialJobs], data: [1]
[2022-03-03 17:35:26] DEBUG (nni.runtime.msg_dispatcher/Thread-1) requesting for generating params of [0]
[2022-03-03 17:35:26] DEBUG (nni.tuner/Thread-1) generating param for 0
[2022-03-03 17:35:26] DEBUG (nni.runtime.protocol/Thread-1) Sending command, data: [b'TR00000000000172{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 256, "lr": 0.01, "momentum": 0.6927792816192837}, "parameter_index": 0}']
[2022-03-03 17:35:31] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'PI00000000000000']
[2022-03-03 17:35:31] DEBUG (nni.runtime.protocol/MainThread) Received command, data: []
[2022-03-03 17:35:31] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.Ping], data: []
[2022-03-03 17:35:36] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'PI00000000000000']
[2022-03-03 17:35:36] DEBUG (nni.runtime.protocol/MainThread) Received command, data: []
[2022-03-03 17:35:36] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.Ping], data: [][CommandType.Ping], data: []
[2022-03-03 17:38:06] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'EN00000000000252']
[2022-03-03 17:38:06] DEBUG (nni.runtime.protocol/MainThread) Received command, data: [{"trial_job_id":"ZfUOJ","event":"SUCCEEDED","hyper_params":"{\"parameter_id\": 0, \"parameter_source\": \"algorithm\", \"parameters\": {\"batch_size\": 32, \"hidden_size\": 256, \"lr\": 0.01, \"momentum\": 0.6927792816192837}, \"parameter_index\": 0}"}]
[2022-03-03 17:38:06] DEBUG (nni.runtime.msg_dispatcher_base/Thread-2) process_command: command: [CommandType.TrialEnd], data: [OrderedDict([('trial_job_id', 'ZfUOJ'), ('event', 'SUCCEEDED'), ('hyper_params', '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 256, "lr": 0.01, "momentum": 0.6927792816192837}, "parameter_index": 0}')])]
[2022-03-03 17:38:06] DEBUG (nni.runtime.protocol/MainThread) Received command, header: [b'GE00000000000001']
[2022-03-03 17:38:06] DEBUG (nni.runtime.protocol/MainThread) Received command, data: [1]
[2022-03-03 17:38:06] DEBUG (nni.runtime.msg_dispatcher_base/Thread-1) process_command: command: [CommandType.RequestTrialJobs], data: [1]
[2022-03-03 17:38:06] DEBUG (nni.runtime.msg_dispatcher/Thread-1) requesting for generating params of [1]
[2022-03-03 17:38:06] DEBUG (nni.tuner/Thread-1) generating param for 1
[2022-03-03 17:38:06] DEBUG (nni.runtime.protocol/Thread-1) Sending command, data: [b'TR00000000000173{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 1024, "lr": 0.01, "momentum": 0.9788903515270269}, "parameter_index": 0}']

...

this is NNIManager log

NNIManager log

...

  msg: '==========@@@@@@@@@@@================='
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: 'adl' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '0' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: "{'default': 98.3, 'accuracy': 98.3}"
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: '-------reuse_mode-------'
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '0' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '----- change -----' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: 'False' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '0' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: '-------NNI_PLATFORM---------'
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: 'local' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: { tag: 'trial', stdOutputType: 'Stdout', msg: '0' }
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: `NNISDK_MEb'{"parameter_id": 0, "trial_job_id": "ZfUOJ", "type": "PERIODICAL", "sequence": 0, "value": "98.3"}'`
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: `NNISDK_MEb'{"parameter_id": 0, "trial_job_id": "ZfUOJ", "type": "PERIODICAL", "sequence": 1, "value": "98.3"}'`
}
[2022-03-03 17:35:53] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
  msg: 'test_acc_intermediate 98.3'
}
[2022-03-03 17:35:56] DEBUG (IpcInterface) ipcInterface command type: [PI], content:[]
[2022-03-03 17:36:01] DEBUG (IpcInterface) ipcInterface command type: [PI], content:[]
[2022-03-03 17:36:01] DEBUG (NNIRestHandler) GET: /experiment-metadata: body: {}
[2022-03-03 17:36:01] DEBUG (NNIRestHandler) GET: /experiment: body: {}
[2022-03-03 17:36:01] DEBUG (NNIRestHandler) GET: /metric-data: body: {}
[2022-03-03 17:36:01] DEBUG (NNIRestHandler) GET: /trial-jobs: body: {}
[2022-03-03 17:36:01] DEBUG (NNIDataStore) getTrialJobsByReplayEvents begin
[2022-03-03 17:36:01] DEBUG (NNIDataStore) getTrialJobsByReplayEvents done
[2022-03-03 17:36:01] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2022-03-03 17:36:05] INFO (RestServer) POST: /stdout/3iAEzYZS/ZfUOJ: body: {
  tag: 'trial',
  stdOutputType: 'Stdout',
...

I tried to print some env_variable and result for report_intermediate_result function

I guess it has trouble in report_intermediate_result(metric) function and platform.send_metric but, it's uncertainty..

report_intermediate_result() function

def report_intermediate_result(metric):
    """
    Reports intermediate result to NNI.
    Parameters
    ----------
    metric:
        serializable object.
    """
    global _intermediate_seq
    assert _params or trial_env_vars.NNI_PLATFORM is None, \
        'nni.get_next_parameter() needs to be called before report_intermediate_result'
    metric = dump({
        'parameter_id': _params['parameter_id'] if _params else None,
        'trial_job_id': trial_env_vars.NNI_TRIAL_JOB_ID,
        'type': 'PERIODICAL',
        'sequence': _intermediate_seq,
        'value': dump(metric)
    })
    _intermediate_seq += 1
    platform.send_metric(metric)

platform.__init__

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

from ..env_vars import trial_env_vars, dispatcher_env_vars

assert dispatcher_env_vars.SDK_PROCESS != 'dispatcher'

if trial_env_vars.NNI_PLATFORM is None:
    from .standalone import *
elif trial_env_vars.NNI_PLATFORM == 'unittest':
    from .test import *
else:
    from .local import *

platform.local send_metric() function

def send_metric(string):
    if _nni_platform != 'local' or _reuse_mode in ('true', 'True'):
        assert len(string) < 1000000, 'Metric too long'
        print("NNISDK_MEb'%s'" % (string), flush=True)
    else:
        data = (string + '\n').encode('utf8')
        assert len(data) < 1000000, 'Metric too long'
        _metric_file.write(b'ME%06d%b' % (len(data), data))
        _metric_file.flush()
        if sys.platform == "win32":
            file = open(_metric_file.name)
            file.close()
        else:
            subprocess.run(['touch', _metric_file.name], check=True)

when i print NNI_PLATFORM is adl, so log said "NNISDK_ME~~~"

is there anything different way to get a metric on web ui or anything i did wrong

please let me know thanks!

Environment:

NNI version: 2.6 Training service (local|remote|pai|aml|etc): frameworkcontroller Client OS: ubuntu 18.04 Server OS (for remote mode only): Python version: 3.6.9 PyTorch/TensorFlow version: 1.10.1+cu102

scarlett2018 commented 2 years ago

@pw2393 - would you mind take a look at this AdaptDL usage issue?

scarlett2018 commented 2 years ago

@pw2393 - would you mind take a look at this AdaptDL usage issue?

@juniroc @pw2393 - is this problem solved? As there are no responds so far, closing as overdued, feel free to reopen if it is still an issue.