Open Qinghao-Hu opened 2 years ago
Thanks for reporting the issue. It will be fixed soon.
Currently, CGO only supports remote
mode because CGO takes the control of trial scheduling from nni_manager
. Only remote
has the interface to allow execution engine to do the scheduling.
Hi @hzhua,
Thanks for your effort in fixing CGO issue #4621. I have applied all the changes in my local NNI 2.7 library.
But I meet another problem: it seems the BypassAccelerator
is not finished yet? (still running the cgo_mnasnet
example)
Traceback (most recent call last):
File "/home/xxx/cgo_mnasnet/test.py", line 40, in <module>
trainer = cgo.Classification(
File "/home/xxx/miniconda3/lib/python3.9/site-packages/nni/retiarii/evaluator/pytorch/cgo/evaluator.py", line 201, in __init__
module, Trainer(use_cgo=True, **trainer_kwargs), train_dataloader=train_dataloader, val_dataloaders=val_dataloaders
File "/home/xxx/miniconda3/lib/python3.9/site-packages/nni/retiarii/evaluator/pytorch/cgo/trainer.py", line 29, in __init__
trainer_kwargs["accelerator"] = BypassAccelerator(device="cpu", **trainer_kwargs)
TypeError: Can't instantiate abstract class BypassAccelerator with abstract methods auto_device_count, get_parallel_devices, is_available, parse_devices
Which version of pytorch-lightning
are you using? It seems that you are using an incompatible version that have different interfaces for Accelerator.
Could you try pytorch-lightning 1.5.1?
Hi @hzhua,
Thank you very much. Changing the pytorch-lightning version from 1.6.1 to 1.5.1 successfully solve this issue.
Another question is there are some trials (typically 2-3 trials) that fail in every experiment, as shown below. And the experiment cannot finish (the last trail will stay in the running state).
I tried to find out the issue. But in ~/nni-experiments/[exp_id]/trails
, I only can find SUCCEEDED trail folder and all of them are empty.
Besides, I checked /tmp/nni-experiments/[exp_id]
. I can find the failed trail folders, like ErXhQ
, but seems very similar to succeeded trails. In trialrunner_stdout
files, I find some information on trail ErXhQ
, but seems can not imply the reason for the trail failure.
Epoch 0: 49%|████▉ | 293/600 [00:10<00:11, 27.32it/s, loss=1.65, v_num=2, train_loss_0=1.790, train_0_acc=0.330]
Epoch 1: 67%|██████▋ | 400/600 [00:16<00:08, 23.99it/s, loss=1.44, v_num=0, train_loss_0=1.470, train_0_acc=0.530, val_loss_0=6.640, val_0_acc=0.100]
[2022-04-25 09:51:40.406714] INFO ErXhQ: subprocess terminated. Exit code is 1.
[2022-04-25 09:51:40] INFO (nni_syslog_runner_runner_q5sji/MainThread) [2022-04-25 09:51:40.406714] INFO ErXhQ: subprocess terminated. Exit code is 1.
[2022-04-25 09:51:40.407191] INFO ErXhQ: clean up trial
[2022-04-25 09:51:40] INFO (nni_syslog_runner_runner_q5sji/MainThread) [2022-04-25 09:51:40.407191] INFO ErXhQ: clean up trial
Could you provide some insights on this issue? Thank you.
Since Retiarii does not check the feasibility of mutation, some mutations may be invalid. For a trial running multiple models with de-duplicated input, if there is any model invalid, Retiarii will disable CGO and fall back to running each model individually in different trials.
In your case, the first trial ZUzv4
may be due to the shape mismatch in mutation. You may find errors in trialrunner_stdout
like:
RuntimeError: Given groups=1, weight of size [1280, 24, 1, 1], expected input[100, 30, 8, 8] to have 24 channels, but got 30 channels instead
But it is strange that a trial fails without printing error log. Could you please share the file /tmp/nni-experiments/[exp_id]/trials/ErXhQ/parameter.cfg
for me to debug?
Thanks for your quick response, I find the error message of ErXhQ
.
RuntimeError: Given groups=1, weight of size [1280, 30, 1, 1], expected input[100, 18, 8, 8] to have 30 channels, but got 18 channels instead
How can I get the best configuration when all the trails are finished? Seems not to print the best trail as the base
execution_engine does.
Describe the issue:
nni/test/retiarii_test/
folder, but both of them can not be executed on my platform. I just want to dive deeper into this promising feature.For
cgo_mnasnet/test.py
remote
mode?Environment: