microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.02k stars 1.81k forks source link

View Web UI in Google Colab #1580

Closed thomas-beznik closed 5 years ago

thomas-beznik commented 5 years ago

Hello everyone,

I am trying to use nni inside of Google colaboratory. I am able to run experiments (I see that the experiment folder and trial folders are created), but I can't open the Web UI... Since the experiment isn't started on my local machine but on the google colab server, I know that I can't access it with the usual "http://127.0.0.1"; I have thus obtained the IP address of the notebook using https://stackoverflow.com/questions/50639768/colaboratory-virtual-instances-ip-range, but it still doesn't work ("this site can't be reached").

How can I view the Web UI on Google colab ? Would it be possible to use Google colab as a remote machine ?

If it isn't possible, can I view the results of an experiment in the Web UI after its completion ? This way I could run the experiment on Google colab and then observe the results locally.

Thanks for the help!

ultmaster commented 5 years ago

Hi. Sorry for slow response.

Here is some hint: facebookresearch/visdom#419.

It turns out a little modification could work. You can directly copy the following and try it in your colab:

! npm install -g localtunnel
! pip install nni
! git clone https://github.com/microsoft/nni
get_ipython().system_raw('nnictl create --config nni/examples/trials/mnist/config.yml --port 5000 2>&1 &')
import time
time.sleep(10)
get_ipython().system_raw('lt --port 5000 >> url.txt 2>&1 &')
time.sleep(5)
! cat url.txt

Output looks like:

......
your url is: https://odd-dragon-72.localtunnel.me

Done.

image

scarlett2018 commented 5 years ago

Hi. Sorry for slow response.

Here is some hint: facebookresearch/visdom#419.

It turns out a little modification could work. You can directly copy the following and try it in your colab:

! npm install -g localtunnel
! pip install nni
! git clone https://github.com/microsoft/nni
get_ipython().system_raw('nnictl create --config nni/examples/trials/mnist/config.yml --port 5000 2>&1 &')
import time
time.sleep(10)
get_ipython().system_raw('lt --port 5000 >> url.txt 2>&1 &')
time.sleep(5)
! cat url.txt

Output looks like:

......
your url is: https://odd-dragon-72.localtunnel.me

Done.

image

@thomas-beznik had you tried the suggestion from @ultmaster , does it work for you? may we close the issue in this case? thanks.

thomas-beznik commented 5 years ago

It works indeed, thank you very much !

JunweiSUN commented 4 years ago

hi, @thomas-beznik The original solution seems not working anymore. We have provided a new solution, see https://nni.readthedocs.io/en/latest/CommunitySharings/NNI_colab_support.html

albertotono commented 3 years ago

Thanks @JunweiSUN,

really nice and clever solution.

I run the MNIST_keras example in Colab and Ngrok

Screenshot from 2021-03-23 20-51-31

It says that all failed.

But the code is running well below the results

Epoch 9/10
5/5 [==============================] - 0s 27ms/step - loss: 0.1320 - accuracy: 0.9600 - val_loss: 0.3516 - val_accuracy: 0.8790
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.1483837068080902, "accuracy": 0.9549999833106995, "val_loss": 0.3515503406524658, "val_accuracy": 0.8790000081062317}  (Index 28)
Epoch 10/10
5/5 [==============================] - 0s 24ms/step - loss: 0.1251 - accuracy: 0.9651 - val_loss: 0.3444 - val_accuracy: 0.8830
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.12060212343931198, "accuracy": 0.9649999737739563, "val_loss": 0.344418466091156, "val_accuracy": 0.8830000162124634}  (Index 29)
[2021-03-24 03:19:17] INFO (nni/MainThread) Final result: 0.8830000162124634

How can I display more data on the WebUI and the experiments properly?

JunweiSUN commented 3 years ago

Thanks @JunweiSUN,

really nice and clever solution.

I run the MNIST_keras example in Colab and Ngrok

Screenshot from 2021-03-23 20-51-31

It says that all failed.

But the code is running well below the results

Epoch 9/10
5/5 [==============================] - 0s 27ms/step - loss: 0.1320 - accuracy: 0.9600 - val_loss: 0.3516 - val_accuracy: 0.8790
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.1483837068080902, "accuracy": 0.9549999833106995, "val_loss": 0.3515503406524658, "val_accuracy": 0.8790000081062317}  (Index 28)
Epoch 10/10
5/5 [==============================] - 0s 24ms/step - loss: 0.1251 - accuracy: 0.9651 - val_loss: 0.3444 - val_accuracy: 0.8830
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.12060212343931198, "accuracy": 0.9649999737739563, "val_loss": 0.344418466091156, "val_accuracy": 0.8830000162124634}  (Index 29)
[2021-03-24 03:19:17] INFO (nni/MainThread) Final result: 0.8830000162124634

How can I display more data on the WebUI and the experiments properly?

Can you check the log by clicking the "View trial error" button in WebUI and post them?

albertotono commented 3 years ago

This is in the summary

{
            "trialJobId": "alPFT",
            "status": "FAILED",
            "hyperParameters": [
                "{\"parameter_id\":9,\"parameter_source\":\"algorithm\",\"parameters\":{\"batch_size\":64,\"hidden_size\":512,\"lr\":0.0001,\"momentum\":0.49373512088595206},\"parameter_index\":0}"
            ],
            "logPath": "file://localhost:/root/nni-experiments/ENw2gqIn/trials/alPFT",
            "startTime": 1616555907788,
            "sequenceId": 9,
            "endTime": 1616555911886,
            "stderrPath": "file:/localhost:/root/nni-experiments/ENw2gqIn/trials/alPFT/stderr",
            "intermediate": []
        }

        I run this code

import argparse import logging

import keras import numpy as np from keras import backend as K from keras.datasets import mnist from keras.layers import Conv2D, Dense, Flatten, MaxPooling2D from keras.models import Sequential

import nni #Hyperparameter Tuning

LOG = logging.getLogger('mnist_keras') K.set_image_data_format('channels_last')

H, W = 28, 28 NUM_CLASSES = 10

def create_mnist_model(hyper_params, input_shape=(H, W, 1), num_classes=NUM_CLASSES):

layers = [
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(100, activation='relu'),
    Dense(num_classes, activation='softmax')
]

model = Sequential(layers)

if hyper_params['optimizer'] == 'Adam':
    optimizer = keras.optimizers.Adam(lr=hyper_params['learning_rate'])
else:
    optimizer = keras.optimizers.SGD(lr=hyper_params['learning_rate'], momentum=0.9)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=optimizer, metrics=['accuracy'])

return model

def load_mnist_data(args):

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = (np.expand_dims(x_train, -1).astype(np.float) / 255.)[:args.num_train]
x_test = (np.expand_dims(x_test, -1).astype(np.float) / 255.)[:args.num_test]
y_train = keras.utils.to_categorical(y_train, NUM_CLASSES)[:args.num_train]
y_test = keras.utils.to_categorical(y_test, NUM_CLASSES)[:args.num_test]

LOG.debug('x_train shape: %s', (x_train.shape,))
LOG.debug('x_test shape: %s', (x_test.shape,))

return x_train, y_train, x_test, y_test

class SendMetrics(keras.callbacks.Callback):

def on_epoch_end(self, epoch, logs={}):

    LOG.debug(logs)
    nni.report_intermediate_result(logs)

def train(args, params):

x_train, y_train, x_test, y_test = load_mnist_data(args)
model = create_mnist_model(params)

model.fit(x_train, y_train, batch_size=args.batch_size, epochs=args.epochs, verbose=1,
    validation_data=(x_test, y_test), callbacks=[SendMetrics()])

_, acc = model.evaluate(x_test, y_test, verbose=0)
LOG.debug('Final result is: %d', acc)
nni.report_final_result(acc)

def generate_default_params():

return {
    'optimizer': 'Adam',
    'learning_rate': 0.001
}

if name == 'main': PARSER = argparse.ArgumentParser() PARSER.add_argument("--batch_size", type=int, default=200, help="batch size", required=False) PARSER.add_argument("--epochs", type=int, default=10, help="Train epochs", required=False) PARSER.add_argument("--num_train", type=int, default=1000, help="Number of train samples to be used, maximum 60000", required=False) PARSER.add_argument("--num_test", type=int, default=1000, help="Number of test samples to be used, maximum 10000", required=False)

ARGS, UNKNOWN = PARSER.parse_known_args()

try:
    # get parameters from tuner
    RECEIVED_PARAMS = nni.get_next_parameter()
    LOG.debug(RECEIVED_PARAMS)
    PARAMS = generate_default_params()
    PARAMS.update(RECEIVED_PARAMS)
    # train
    train(ARGS, PARAMS)
except Exception as e:
    LOG.exception(e)
    raise
albertotono commented 3 years ago

I found the Dispatcher Log


[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
[2021-03-24 03:06:02] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001349 seconds
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001467 seconds
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001477 seconds
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001557 seconds
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001508 seconds
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002237 seconds
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001532 seconds
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001626 seconds
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001536 seconds
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001493 seconds
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002188 seconds
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials

This is my interface in Trails details

Screenshot from 2021-03-23 21-16-38

JunweiSUN commented 3 years ago

I found the Dispatcher Log


[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
[2021-03-24 03:06:02] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001349 seconds
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001467 seconds
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001477 seconds
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001557 seconds
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001508 seconds
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002237 seconds
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001532 seconds
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001626 seconds
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001536 seconds
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001493 seconds
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002188 seconds
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials

This is my interface in Trails details

Screenshot from 2021-03-23 21-16-38

Click each trial shows in your picture and you will find a blue "View trial error" button that records the errors.

albertotono commented 3 years ago

Now, I was capable to run one trial [SUCCEEDED].

Not sure why now I have only 4 Trials and not 10 like the epochs Screenshot from 2021-03-24 08-42-30

I found the Log - Trial , here below

[2021-03-24 15:16:00] PRINT {'data_dir': './data', 'batch_size': 64, 'batch_num': None, 'hidden_size': 1024, 'lr': 0.001, 'momentum': 0.441724246819169, 'epochs': 10, 'seed': 1, 'no_cuda': False, 'log_interval': 1000}
[2021-03-24 15:16:01] PRINT Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
[2021-03-24 15:16:01] ERROR (mnist_AutoML/MainThread) HTTP Error 503: Service Unavailable
Traceback (most recent call last):
  File "mnist.py", line 163, in <module>
    main(params)
  File "mnist.py", line 97, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py", line 79, in __init__
    self.download()
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py", line 146, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 314, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 140, in download_url
    raise e
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 132, in download_url
    _urlretrieve(url, fpath)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 29, in _urlretrieve
    with urllib.request.urlopen(urllib.request.Request(url, headers={"User-Agent": USER_AGENT})) as response:
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
albertotono commented 3 years ago

It took some time but now I have actually 7 trials --> it is getting there, but it is taking a lot of time. It seems working now. Only the first 2 failed so far. I think now it exceeded the max duration

JunweiSUN commented 3 years ago

It took some time but now I have actually 7 trials --> it is getting there, but it is taking a lot of time. It seems working now. Only the first 2 failed so far. I think now it exceeded the max duration

Yeah. The first two fails seem to have something wrong with the dataset downloading process. Maybe the network condition is not good at that time.

albertotono commented 3 years ago

Thank you so much for your help @JunweiSUN, much appreciate your support.

JunweiSUN commented 3 years ago

Thank you so much for your help @JunweiSUN, much appreciate your support.

Seems that you only run your code on cpu. If you turn on the GPU on Colab, a trial will finish in a few minutes.

albertotono commented 3 years ago

Thanks for the suggestion, actually I am running this code https://github.com/microsoft/nni/blob/a0ae02e6c5e096b70cb6bb1ecd643dddd81000ba/examples/trials/mnist-pytorch/mnist.py

and it should be default on GPU as far as I can tell. I also place GPU on Colab.

Not sure why it is still running on CPU.

@JunweiSUN , so the model is performing the training in Colab, but how can I check the training. Is there a way to visualize the process in Colab, or I should always rely on the WebUI?

JunweiSUN commented 3 years ago

Thanks for the suggestion, actually I am running this code https://github.com/microsoft/nni/blob/a0ae02e6c5e096b70cb6bb1ecd643dddd81000ba/examples/trials/mnist-pytorch/mnist.py

and it should be default on GPU as far as I can tell. I also place GPU on Colab.

Not sure why it is still running on CPU.

@JunweiSUN , so the model is performing the training in Colab, but how can I check the training. Is there a way to visualize the process in Colab, or I should always rely on the WebUI?

To use a GPU, you need to set the gpuNum in config.yml to be greater than 0. If you want to check the training status by not using the WebUI, you can try the nnictl command or Launch an Experiment from Python.

albertotono commented 3 years ago

Awesome! With the GPU it worked. I thought that gpuNum was referring to the GPU number ( like the first GPU) not to the number of GPU ( How many GPU do you want to use?) nice. now it is clear and works well. I will try to Launch Experiment from Python soon and let you know the results. Great work.