CGO Engine Example - Githubissues

Qinghao-Hu commented 2 years ago

Describe the issue:

Would you please share some examples for the CGO engine usage? I find the two examples under nni/test/retiarii_test/ folder, but both of them can not be executed on my platform. I just want to dive deeper into this promising feature.

For cgo_mnasnet/test.py

Traceback (most recent call last):
  File "/home/xxx/xxx/cgo_mnasnet/test.py", line 64, in <module>
    exp_config.training_service.use_active_gpu = True
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/nni/experiment/config/base.py", line 248, in __setattr__
    raise AttributeError(f'{type(self).__name__} does not have field {name}')
AttributeError: RemoteConfig does not have field use_active_gpu

Why CGO only support remote mode?

Environment:

NNI version: 2.6.1
Training service (local|remote|pai|aml|etc):local & remote, with 4 RTX 3090
Client OS: Ubuntu 20.04
Server OS (for remote mode only):
Python version: 3.9.7
PyTorch/TensorFlow version:PyTorch 1.10.2
Is conda/virtualenv/venv used?: Yes, use conda
Is running in Docker?: No

hzhua commented 2 years ago

Thanks for reporting the issue. It will be fixed soon.

Currently, CGO only supports remote mode because CGO takes the control of trial scheduling from nni_manager. Only remote has the interface to allow execution engine to do the scheduling.

Qinghao-Hu commented 2 years ago

Hi @hzhua,

Thanks for your effort in fixing CGO issue #4621. I have applied all the changes in my local NNI 2.7 library.

But I meet another problem: it seems the BypassAccelerator is not finished yet? (still running the cgo_mnasnet example)

Traceback (most recent call last):
  File "/home/xxx/cgo_mnasnet/test.py", line 40, in <module>
    trainer = cgo.Classification(
  File "/home/xxx/miniconda3/lib/python3.9/site-packages/nni/retiarii/evaluator/pytorch/cgo/evaluator.py", line 201, in __init__
    module, Trainer(use_cgo=True, **trainer_kwargs), train_dataloader=train_dataloader, val_dataloaders=val_dataloaders
  File "/home/xxx/miniconda3/lib/python3.9/site-packages/nni/retiarii/evaluator/pytorch/cgo/trainer.py", line 29, in __init__
    trainer_kwargs["accelerator"] = BypassAccelerator(device="cpu", **trainer_kwargs)
TypeError: Can't instantiate abstract class BypassAccelerator with abstract methods auto_device_count, get_parallel_devices, is_available, parse_devices

hzhua commented 2 years ago

Which version of pytorch-lightning are you using? It seems that you are using an incompatible version that have different interfaces for Accelerator. Could you try pytorch-lightning 1.5.1?

Qinghao-Hu commented 2 years ago

Hi @hzhua,

Thank you very much. Changing the pytorch-lightning version from 1.6.1 to 1.5.1 successfully solve this issue.

Another question is there are some trials (typically 2-3 trials) that fail in every experiment, as shown below. And the experiment cannot finish (the last trail will stay in the running state).

Screenshot from 2022-04-25 10-09-13

I tried to find out the issue. But in ~/nni-experiments/[exp_id]/trails, I only can find SUCCEEDED trail folder and all of them are empty. Besides, I checked /tmp/nni-experiments/[exp_id]. I can find the failed trail folders, like ErXhQ, but seems very similar to succeeded trails. In trialrunner_stdout files, I find some information on trail ErXhQ, but seems can not imply the reason for the trail failure.

Epoch 0:  49%|████▉     | 293/600 [00:10<00:11, 27.32it/s, loss=1.65, v_num=2, train_loss_0=1.790, train_0_acc=0.330]
Epoch 1:  67%|██████▋   | 400/600 [00:16<00:08, 23.99it/s, loss=1.44, v_num=0, train_loss_0=1.470, train_0_acc=0.530, val_loss_0=6.640, val_0_acc=0.100]
[2022-04-25 09:51:40.406714] INFO ErXhQ: subprocess terminated. Exit code is 1.
[2022-04-25 09:51:40] INFO (nni_syslog_runner_runner_q5sji/MainThread) [2022-04-25 09:51:40.406714] INFO ErXhQ: subprocess terminated. Exit code is 1.
[2022-04-25 09:51:40.407191] INFO ErXhQ: clean up trial
[2022-04-25 09:51:40] INFO (nni_syslog_runner_runner_q5sji/MainThread) [2022-04-25 09:51:40.407191] INFO ErXhQ: clean up trial

Could you provide some insights on this issue? Thank you.

hzhua commented 2 years ago

Since Retiarii does not check the feasibility of mutation, some mutations may be invalid. For a trial running multiple models with de-duplicated input, if there is any model invalid, Retiarii will disable CGO and fall back to running each model individually in different trials.

In your case, the first trial ZUzv4 may be due to the shape mismatch in mutation. You may find errors in trialrunner_stdout like:

RuntimeError: Given groups=1, weight of size [1280, 24, 1, 1], expected input[100, 30, 8, 8] to have 24 channels, but got 30 channels instead

But it is strange that a trial fails without printing error log. Could you please share the file /tmp/nni-experiments/[exp_id]/trials/ErXhQ/parameter.cfg for me to debug?

Qinghao-Hu commented 2 years ago

Thanks for your quick response, I find the error message of ErXhQ.

RuntimeError: Given groups=1, weight of size [1280, 30, 1, 1], expected input[100, 18, 8, 8] to have 30 channels, but got 18 channels instead

How can I get the best configuration when all the trails are finished? Seems not to print the best trail as the base execution_engine does.

hzhua commented 2 years ago

It is the same as other execution engines. Please check this link.

Just adding one line after exp.run(...) best_model_code = exp.export_top_models(formatter='code'))

microsoft / nni

CGO Engine Example #4596