Closed thomas-beznik closed 5 years ago
Hi. Sorry for slow response.
Here is some hint: facebookresearch/visdom#419.
It turns out a little modification could work. You can directly copy the following and try it in your colab:
! npm install -g localtunnel
! pip install nni
! git clone https://github.com/microsoft/nni
get_ipython().system_raw('nnictl create --config nni/examples/trials/mnist/config.yml --port 5000 2>&1 &')
import time
time.sleep(10)
get_ipython().system_raw('lt --port 5000 >> url.txt 2>&1 &')
time.sleep(5)
! cat url.txt
Output looks like:
......
your url is: https://odd-dragon-72.localtunnel.me
Done.
Hi. Sorry for slow response.
Here is some hint: facebookresearch/visdom#419.
It turns out a little modification could work. You can directly copy the following and try it in your colab:
! npm install -g localtunnel ! pip install nni ! git clone https://github.com/microsoft/nni get_ipython().system_raw('nnictl create --config nni/examples/trials/mnist/config.yml --port 5000 2>&1 &') import time time.sleep(10) get_ipython().system_raw('lt --port 5000 >> url.txt 2>&1 &') time.sleep(5) ! cat url.txt
Output looks like:
...... your url is: https://odd-dragon-72.localtunnel.me
Done.
@thomas-beznik had you tried the suggestion from @ultmaster , does it work for you? may we close the issue in this case? thanks.
It works indeed, thank you very much !
hi, @thomas-beznik The original solution seems not working anymore. We have provided a new solution, see https://nni.readthedocs.io/en/latest/CommunitySharings/NNI_colab_support.html
Thanks @JunweiSUN,
really nice and clever solution.
I run the MNIST_keras example in Colab and Ngrok
It says that all failed.
But the code is running well below the results
Epoch 9/10
5/5 [==============================] - 0s 27ms/step - loss: 0.1320 - accuracy: 0.9600 - val_loss: 0.3516 - val_accuracy: 0.8790
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.1483837068080902, "accuracy": 0.9549999833106995, "val_loss": 0.3515503406524658, "val_accuracy": 0.8790000081062317} (Index 28)
Epoch 10/10
5/5 [==============================] - 0s 24ms/step - loss: 0.1251 - accuracy: 0.9651 - val_loss: 0.3444 - val_accuracy: 0.8830
[2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.12060212343931198, "accuracy": 0.9649999737739563, "val_loss": 0.344418466091156, "val_accuracy": 0.8830000162124634} (Index 29)
[2021-03-24 03:19:17] INFO (nni/MainThread) Final result: 0.8830000162124634
How can I display more data on the WebUI and the experiments properly?
Thanks @JunweiSUN,
really nice and clever solution.
I run the MNIST_keras example in Colab and Ngrok
It says that all failed.
But the code is running well below the results
Epoch 9/10 5/5 [==============================] - 0s 27ms/step - loss: 0.1320 - accuracy: 0.9600 - val_loss: 0.3516 - val_accuracy: 0.8790 [2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.1483837068080902, "accuracy": 0.9549999833106995, "val_loss": 0.3515503406524658, "val_accuracy": 0.8790000081062317} (Index 28) Epoch 10/10 5/5 [==============================] - 0s 24ms/step - loss: 0.1251 - accuracy: 0.9651 - val_loss: 0.3444 - val_accuracy: 0.8830 [2021-03-24 03:19:16] INFO (nni/MainThread) Intermediate result: {"loss": 0.12060212343931198, "accuracy": 0.9649999737739563, "val_loss": 0.344418466091156, "val_accuracy": 0.8830000162124634} (Index 29) [2021-03-24 03:19:17] INFO (nni/MainThread) Final result: 0.8830000162124634
How can I display more data on the WebUI and the experiments properly?
Can you check the log by clicking the "View trial error" button in WebUI and post them?
This is in the summary
{
"trialJobId": "alPFT",
"status": "FAILED",
"hyperParameters": [
"{\"parameter_id\":9,\"parameter_source\":\"algorithm\",\"parameters\":{\"batch_size\":64,\"hidden_size\":512,\"lr\":0.0001,\"momentum\":0.49373512088595206},\"parameter_index\":0}"
],
"logPath": "file://localhost:/root/nni-experiments/ENw2gqIn/trials/alPFT",
"startTime": 1616555907788,
"sequenceId": 9,
"endTime": 1616555911886,
"stderrPath": "file:/localhost:/root/nni-experiments/ENw2gqIn/trials/alPFT/stderr",
"intermediate": []
}
I run this code
import argparse import logging
import keras import numpy as np from keras import backend as K from keras.datasets import mnist from keras.layers import Conv2D, Dense, Flatten, MaxPooling2D from keras.models import Sequential
import nni #Hyperparameter Tuning
LOG = logging.getLogger('mnist_keras') K.set_image_data_format('channels_last')
H, W = 28, 28 NUM_CLASSES = 10
def create_mnist_model(hyper_params, input_shape=(H, W, 1), num_classes=NUM_CLASSES):
layers = [
Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(100, activation='relu'),
Dense(num_classes, activation='softmax')
]
model = Sequential(layers)
if hyper_params['optimizer'] == 'Adam':
optimizer = keras.optimizers.Adam(lr=hyper_params['learning_rate'])
else:
optimizer = keras.optimizers.SGD(lr=hyper_params['learning_rate'], momentum=0.9)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=optimizer, metrics=['accuracy'])
return model
def load_mnist_data(args):
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (np.expand_dims(x_train, -1).astype(np.float) / 255.)[:args.num_train]
x_test = (np.expand_dims(x_test, -1).astype(np.float) / 255.)[:args.num_test]
y_train = keras.utils.to_categorical(y_train, NUM_CLASSES)[:args.num_train]
y_test = keras.utils.to_categorical(y_test, NUM_CLASSES)[:args.num_test]
LOG.debug('x_train shape: %s', (x_train.shape,))
LOG.debug('x_test shape: %s', (x_test.shape,))
return x_train, y_train, x_test, y_test
class SendMetrics(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
LOG.debug(logs)
nni.report_intermediate_result(logs)
def train(args, params):
x_train, y_train, x_test, y_test = load_mnist_data(args)
model = create_mnist_model(params)
model.fit(x_train, y_train, batch_size=args.batch_size, epochs=args.epochs, verbose=1,
validation_data=(x_test, y_test), callbacks=[SendMetrics()])
_, acc = model.evaluate(x_test, y_test, verbose=0)
LOG.debug('Final result is: %d', acc)
nni.report_final_result(acc)
def generate_default_params():
return {
'optimizer': 'Adam',
'learning_rate': 0.001
}
if name == 'main': PARSER = argparse.ArgumentParser() PARSER.add_argument("--batch_size", type=int, default=200, help="batch size", required=False) PARSER.add_argument("--epochs", type=int, default=10, help="Train epochs", required=False) PARSER.add_argument("--num_train", type=int, default=1000, help="Number of train samples to be used, maximum 60000", required=False) PARSER.add_argument("--num_test", type=int, default=1000, help="Number of test samples to be used, maximum 10000", required=False)
ARGS, UNKNOWN = PARSER.parse_known_args()
try:
# get parameters from tuner
RECEIVED_PARAMS = nni.get_next_parameter()
LOG.debug(RECEIVED_PARAMS)
PARAMS = generate_default_params()
PARAMS.update(RECEIVED_PARAMS)
# train
train(ARGS, PARAMS)
except Exception as e:
LOG.exception(e)
raise
I found the Dispatcher Log
[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
[2021-03-24 03:06:02] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001349 seconds
[2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001467 seconds
[2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001477 seconds
[2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001557 seconds
[2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001508 seconds
[2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002237 seconds
[2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001532 seconds
[2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001626 seconds
[2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001536 seconds
[2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001493 seconds
[2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002188 seconds
[2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
This is my interface in Trails details
I found the Dispatcher Log
[2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt [2021-03-24 03:06:02] INFO (root/MainThread) Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt [2021-03-24 03:06:02] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001349 seconds [2021-03-24 03:06:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001467 seconds [2021-03-24 03:06:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001477 seconds [2021-03-24 03:06:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001557 seconds [2021-03-24 03:06:51] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001508 seconds [2021-03-24 03:07:06] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002237 seconds [2021-03-24 03:07:21] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001532 seconds [2021-03-24 03:07:36] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001626 seconds [2021-03-24 03:18:02] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001536 seconds [2021-03-24 03:18:12] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001493 seconds [2021-03-24 03:18:22] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002188 seconds [2021-03-24 03:18:32] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials
This is my interface in Trails details
Click each trial shows in your picture and you will find a blue "View trial error" button that records the errors.
Now, I was capable to run one trial [SUCCEEDED].
Not sure why now I have only 4 Trials and not 10 like the epochs
I found the Log - Trial , here below
[2021-03-24 15:16:00] PRINT {'data_dir': './data', 'batch_size': 64, 'batch_num': None, 'hidden_size': 1024, 'lr': 0.001, 'momentum': 0.441724246819169, 'epochs': 10, 'seed': 1, 'no_cuda': False, 'log_interval': 1000}
[2021-03-24 15:16:01] PRINT Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
[2021-03-24 15:16:01] ERROR (mnist_AutoML/MainThread) HTTP Error 503: Service Unavailable
Traceback (most recent call last):
File "mnist.py", line 163, in <module>
main(params)
File "mnist.py", line 97, in main
transforms.Normalize((0.1307,), (0.3081,))
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py", line 79, in __init__
self.download()
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py", line 146, in download
download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 314, in download_and_extract_archive
download_url(url, download_root, filename, md5)
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 140, in download_url
raise e
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 132, in download_url
_urlretrieve(url, fpath)
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py", line 29, in _urlretrieve
with urllib.request.urlopen(urllib.request.Request(url, headers={"User-Agent": USER_AGENT})) as response:
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
It took some time but now I have actually 7 trials --> it is getting there, but it is taking a lot of time. It seems working now. Only the first 2 failed so far. I think now it exceeded the max duration
It took some time but now I have actually 7 trials --> it is getting there, but it is taking a lot of time. It seems working now. Only the first 2 failed so far. I think now it exceeded the max duration
Yeah. The first two fails seem to have something wrong with the dataset downloading process. Maybe the network condition is not good at that time.
Thank you so much for your help @JunweiSUN, much appreciate your support.
Thank you so much for your help @JunweiSUN, much appreciate your support.
Seems that you only run your code on cpu. If you turn on the GPU on Colab, a trial will finish in a few minutes.
Thanks for the suggestion, actually I am running this code https://github.com/microsoft/nni/blob/a0ae02e6c5e096b70cb6bb1ecd643dddd81000ba/examples/trials/mnist-pytorch/mnist.py
and it should be default on GPU as far as I can tell. I also place GPU on Colab.
Not sure why it is still running on CPU.
@JunweiSUN , so the model is performing the training in Colab, but how can I check the training. Is there a way to visualize the process in Colab, or I should always rely on the WebUI?
Thanks for the suggestion, actually I am running this code https://github.com/microsoft/nni/blob/a0ae02e6c5e096b70cb6bb1ecd643dddd81000ba/examples/trials/mnist-pytorch/mnist.py
and it should be default on GPU as far as I can tell. I also place GPU on Colab.
Not sure why it is still running on CPU.
@JunweiSUN , so the model is performing the training in Colab, but how can I check the training. Is there a way to visualize the process in Colab, or I should always rely on the WebUI?
To use a GPU, you need to set the gpuNum
in config.yml to be greater than 0. If you want to check the training status by not using the WebUI, you can try the nnictl command or Launch an Experiment from Python.
Awesome! With the GPU it worked. I thought that gpuNum was referring to the GPU number ( like the first GPU) not to the number of GPU ( How many GPU do you want to use?) nice. now it is clear and works well. I will try to Launch Experiment from Python soon and let you know the results. Great work.
Hello everyone,
I am trying to use nni inside of Google colaboratory. I am able to run experiments (I see that the experiment folder and trial folders are created), but I can't open the Web UI... Since the experiment isn't started on my local machine but on the google colab server, I know that I can't access it with the usual "http://127.0.0.1"; I have thus obtained the IP address of the notebook using https://stackoverflow.com/questions/50639768/colaboratory-virtual-instances-ip-range, but it still doesn't work ("this site can't be reached").
How can I view the Web UI on Google colab ? Would it be possible to use Google colab as a remote machine ?
If it isn't possible, can I view the results of an experiment in the Web UI after its completion ? This way I could run the experiment on Google colab and then observe the results locally.
Thanks for the help!