Closed ding3820 closed 3 years ago
Are you in jupyter or an ipython environment? model_wrapper-like serializer objects do not work in interactive python for now. You can define them in a separate python file.
Hi,
Thanks for your prompt response. Yes, I was running on Jupyter notebook. I try to wrap the original model with an outer interface class and use serialize
function for my dataset. Then, reformulate the code to python script and it works. So the takeaway message here is to write a python script if you want to follow the graph based execution engine.
But there's another issue on TPEStrategies. I try to run the new python script but here what I got.
Traceback (most recent call last):
File "nas.py", line 210, in <module>
exp.run(exp_config, 8118)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 289, in run
self.start(port, debug)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 260, in start
self._start_strategy()
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 193, in _start_strategy
self.strategy.run(base_model_ir, self.applied_mutators)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/strategy/tpe_strategy.py", line 66, in run
self.tpe_sampler.update_sample_space(sample_space)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/strategy/tpe_strategy.py", line 26, in update_sample_space
self.tpe_tuner.update_search_space(search_space)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/algorithms/hpo/hyperopt_tuner.py", line 258, in update_search_space
pass_expr_memo_ctrl=None)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/hyperopt/base.py", line 790, in __init__
pyll.toposort(self.expr)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 715, in toposort
assert order[-1] == expr
IndexError: list index out of range
I also try with Random strategy which doesn't show any error message but it shows fail in the trial panel on WebUI. This is a bit confusing maybe I should share my implementation.
@basic_unit
class FCLayer(nn.Module):
def __init__(self, input_size, hidden_dim, label, n):
super().__init__()
self.net = nn.Sequential(
nn.LayerChoice([
nn.BatchNorm1d(input_size),
nn.Identity()
], label=label+'_bn_'+str(n)),
nn.Linear(input_size, hidden_dim),
nn.LayerChoice([
AutoActivation(label=label+'_autoac_'+str(n)),
nn.Identity()
], label=label+'_ac_'+str(n)),
)
def forward(self, x):
out = self.net(x)
return out
@basic_unit
class LSTMDNN(nn.Module):
def __init__(self, serial_in, nonserial_in):
super().__init__()
serial_hidden_dim = nn.ValueChoice([8, 16, 32, 64], label='serial_hidden')
nonserial_hidden_dim = nn.ValueChoice([8, 16, 32, 64], label='nonserial_hidden')
comb_hidden_dim = nn.ValueChoice([16, 32, 64, 128], label='comb_hidden')
self.lstm_n_layers = 1
self.lstm = nn.LSTM(input_size=serial_in, hidden_size=7, num_layers=self.lstm_n_layers, batch_first=True)
serial_nn = []
serial_nn.append(FCLayer(7, serial_hidden_dim, 'serial', 0))
for i in range(2):
serial_nn.append(nn.LayerChoice([
FCLayer(serial_hidden_dim, serial_hidden_dim, 'serial', i+1),
nn.Identity()
], label='serial_nn_{}'.format(str(i))))
serial_nn.append(FCLayer(serial_hidden_dim, 16, 'serial', 3))
self.serial_nn = nn.Sequential(*serial_nn)
nonserial_nn = []
nonserial_nn.append(FCLayer(nonserial_in, nonserial_hidden_dim, 'nonserial', 0))
for i in range(2):
nonserial_nn.append(nn.LayerChoice([
FCLayer(nonserial_hidden_dim, nonserial_hidden_dim, 'nonserial', i+1),
nn.Identity()
], label='nonserial_nn_{}'.format(str(i))))
nonserial_nn.append(FCLayer(nonserial_hidden_dim, 16, 'nonserial', 3))
self.nonserial_nn = nn.Sequential(*nonserial_nn)
comb_nn = []
comb_nn.append(FCLayer(32, comb_hidden_dim, 'comb', 0))
for i in range(3):
comb_nn.append(nn.LayerChoice([
FCLayer(comb_hidden_dim, comb_hidden_dim, 'comb', i+1),
nn.Identity()
], label='comb_nn_{}'.format(str(i))))
self.comb_nn = nn.Sequential(*comb_nn)
self.final_fc = nn.Linear(comb_hidden_dim, 2)
def forward(self, x_non_serial, x_serial):
batch_size = x_serial.size(0)
lstm_hidden = self.init_hidden(batch_size)
# serial data
lstm_out, lstm_hidden = self.lstm(x_serial, lstm_hidden)
lstm_out = lstm_out[:, -1, :] # take the last output
lstm_out = lstm_out.contiguous().view(-1, 7)
serial_out = self.serial_nn(lstm_out.clone())
serial_out = serial_out.view(batch_size, -1)
# non-serial data
nonserial_out = self.nonserial_nn(x_non_serial)
comb_out = torch.cat([serial_out, nonserial_out], dim=1)
comb_out = self.comb_nn(comb_out)
out = self.final_fc(comb_out)
return out
def init_hidden(self, batch_size: int):
hidden = (torch.zeros(self.lstm_n_layers, batch_size, 7).cuda(),
torch.zeros(self.lstm_n_layers, batch_size, 7).cuda())
return hidden
class Net(nn.Module):
def __init__(self, serial_in, nonserial_in):
super().__init__()
self.model = LSTMDNN(serial_in, nonserial_in)
def forward(self, x_non_serial, x_serial):
out = self.model(x_non_serial, x_serial)
return out
simple_strategy = strategy.TPEStrategy()
trainer = pl.Classification(train_dataloader=pl.DataLoader(train_dataset, batch_size=256),
val_dataloaders=pl.DataLoader(val_dataset, batch_size=256),
learning_rate=0.001,
weight_decay=0.0001,
max_epochs=2, gpus=[0])
model = Net(7, 84)
exp = RetiariiExperiment(model, trainer, [], simple_strategy)
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'Sepsis search'
exp_config.trial_concurrency = 2
exp_config.max_trial_number = 20
exp_config.trial_gpu_number = 1
exp_config.max_experiment_duration = '1h'
exp_config.execution_engine = 'base'
exp_config.training_service.use_active_gpu = True
export_formatter = 'dict'
exp.run(exp_config, 8118)
print('Final model:')
for model_code in exp.export_top_models(formatter=export_formatter):
print(model_code)
It would be a great help if you could point a direction for me to fix this problem. Thanks a lot.
I also try with Random strategy which doesn't show any error message but it shows fail in the trial panel on WebUI. This is a bit confusing maybe I should share my implementation.
Maybe you should check the log of each trial?
IndexError: list index out of range
This looks strange. If you check the implementation of TPE/Random strategy, there is actually one line that the strategy is trying to parse the search space via dry run. If you can provide that dried-runned search space, it will be much more helpful.
Hi,
I try with Pure-python Execution Engine where I wrap @model_wrapper
outside the whole PyTorch model instead of using@basic_unit
. Also, change exp config to:
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'test search'
exp_config.trial_concurrency = 2
exp_config.max_trial_number = 50
exp_config.max_experiment_duration = '1h'
In this way, TPE does not show any error message in the console. However, the trial on WebUI still shows fail. If the search space you were mentioning is the one on WebUI, it is completely blank.
and here's the log of each trial, it doesn't show any error or the reason why it fails. dispatcher.log nnimanager.log
I think I'm almost there. Probably just few changes could make this work. Thanks!
Just find out the stderr for each trial. It seems that the trial is failed due to GPU no found. Here's the complete stderr:
/root/anaconda3/envs/nni/lib/python3.7/site-packages/deprecate/deprecation.py:115: LightningDeprecationWarning: The `Accuracy` was deprecated since v1.3.0 in favor of `torchmetrics.classification.accuracy.Accuracy`. It will be removed in v1.5.0.
stream(template_mgs % msg_args)
Traceback (most recent call last):
File "/root/anaconda3/envs/nni/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/anaconda3/envs/nni/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/trial_entry.py", line 25, in <module>
engine.trial_execute_graph()
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/execution/python.py", line 40, in trial_execute_graph
graph_data = PythonGraphData.load(receive_trial_parameters())
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/integration_api.py", line 44, in receive_trial_parameters
params = json_loads(json.dumps(params))
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/nonp.py", line 236, in loads
return json_loads(string, object_pairs_hook=hook, **jsonkwargs)
File "/root/anaconda3/envs/nni/lib/python3.7/json/__init__.py", line 361, in loads
return cls(**kw).decode(s)
File "/root/anaconda3/envs/nni/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/root/anaconda3/envs/nni/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/decoders.py", line 44, in __call__
map = hook(map, properties=self.properties)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/utils.py", line 66, in wrapper
return encoder(*args, **{k: v for k, v in kwargs.items() if k in names})
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/serializer.py", line 46, in _serialize_class_instance_decode
return import_(obj['__type__'])(**obj['arguments'])
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
return fn(self, **kwargs)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 346, in __init__
gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _parse_devices
gpu_ids = device_parser.parse_gpu_ids(gpus)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 91, in parse_gpu_ids
return _sanitize_gpu_ids(gpus)
File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 164, in _sanitize_gpu_ids
f"You requested GPUs: {gpus}\n But your machine only has: {all_available_gpus}"
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
But your machine only has: []
However, at the beginning of the process, it did find the GPU:
[2021-10-12 14:01:58] INFO (pytorch_lightning.utilities.distributed/Thread-2) GPU available: True, used: True
I think you need to set exp_config.trial_gpu_number = 1
. In your log, I see your GPU is completely disabled.
Thanks for your respond. I had figured out the gpu issue earlier today. The experiments seem fine now. To sum up all I have done, I follow the instruction in Pure-python Execution Engine and the settings here. But I still don't know how to make Graph-based Execution Engine work. The attempt I made was all about it. The documentation and sample code seem misleading. Anyway, thanks for your contribution. This work is amazing.
Describe the issue: Hi, I've tried to follow the instruction from here with my custom dataset and model. The base model and dataset work well in regular training/testing way and also the NNI's one-shot Darts trainer. Since one-shot NAS does not support ValueChoice so I turn to multi-trial NAS. From the documentation, I found a debugging function
evaluator._execute(model)
which works fine for my model and dataset. but after I executeexp.run()
, I got some error message showing that there's some problem in the tuple input. Here's the complete message:And how I construct my dataset Class is pretty naive. My dataset consists of two kinds of data which have different size. I just simply make these two input data as tuple in
__getitem__
. The structure of the code looks like below:As far as I know, the first object returned in
__getitem__
will be passed as the input to the model and the second one will be the target to be evaluated with. Beside that, my dataset class won't work withserialize
with the error message below:I don't know if
serialize
function matters.@model_wrapper
and@basic_unit
also show the similar error. I not sure how nni parsing the data. Currently I have no clue where to fix this error. Could you plz give me some direction? ThxEnvironment: