Multi-trial NAS on custom dataset and model with tuple input

ding3820 commented 3 years ago

Describe the issue: Hi, I've tried to follow the instruction from here with my custom dataset and model. The base model and dataset work well in regular training/testing way and also the NNI's one-shot Darts trainer. Since one-shot NAS does not support ValueChoice so I turn to multi-trial NAS. From the documentation, I found a debugging function evaluator._execute(model) which works fine for my model and dataset. but after I execute exp.run(), I got some error message showing that there's some problem in the tuple input. Here's the complete message:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_16172/4265677498.py in <module>
----> 1 exp.run(exp_config, 8118)
      2 print('Final model:')
      3 for model_code in exp.export_top_models(formatter=export_formatter):
      4     print(model_code)

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py in run(self, config, port, debug)
    287             assert config is not None, 'You are using classic search mode, config cannot be None!'
    288             self.config = config
--> 289             self.start(port, debug)
    290 
    291     def _check_exp_status(self) -> bool:

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py in start(self, port, debug)
    258         exp_status_checker = Thread(target=self._check_exp_status)
    259         exp_status_checker.start()
--> 260         self._start_strategy()
    261         # TODO: the experiment should be completed, when strategy exits and there is no running job
    262         _logger.info('Waiting for experiment to become DONE (you can ctrl+c if there is no running trial jobs)...')

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py in _start_strategy(self)
    188         base_model_ir, self.applied_mutators = preprocess_model(
    189             self.base_model, self.trainer, self.applied_mutators, full_ir=self.config.execution_engine != 'py',
--> 190             dummy_input=self.config.dummy_input)
    191 
    192         _logger.info('Start strategy...')

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py in preprocess_model(base_model, trainer, applied_mutators, full_ir, dummy_input)
    122         except Exception as e:
    123             _logger.error('Your base model cannot be parsed by torch.jit.script, please fix the following error:')
--> 124             raise e
    125         if dummy_input is not None:
    126             # FIXME: this is a workaround as full tensor is not supported in configs

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py in preprocess_model(base_model, trainer, applied_mutators, full_ir, dummy_input)
    119     if full_ir:
    120         try:
--> 121             script_module = torch.jit.script(base_model)
    122         except Exception as e:
    123             _logger.error('Your base model cannot be parsed by torch.jit.script, please fix the following error:')

~/anaconda3/envs/nni/lib/python3.7/site-packages/torch/jit/_script.py in script(obj, optimize, _frames_up, _rcb)
   1095         obj = call_prepare_scriptable_func(obj)
   1096         return torch.jit._recursive.create_script_module(
-> 1097             obj, torch.jit._recursive.infer_methods_to_compile
   1098         )
   1099 

~/anaconda3/envs/nni/lib/python3.7/site-packages/torch/jit/_recursive.py in create_script_module(nn_module, stubs_fn, share_types)
    410     concrete_type = get_module_concrete_type(nn_module, share_types)
    411     AttributeTypeIsSupportedChecker().check(nn_module)
--> 412     return create_script_module_impl(nn_module, concrete_type, stubs_fn)
    413 
    414 def create_script_module_impl(nn_module, concrete_type, stubs_fn):

~/anaconda3/envs/nni/lib/python3.7/site-packages/torch/jit/_recursive.py in create_script_module_impl(nn_module, concrete_type, stubs_fn)
    476     # Compile methods if necessary
    477     if concrete_type not in concrete_type_store.methods_compiled:
--> 478         create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
    479         # Create hooks after methods to ensure no name collisions between hooks and methods.
    480         # If done before, hooks can overshadow methods that aren't exported.

~/anaconda3/envs/nni/lib/python3.7/site-packages/torch/jit/_recursive.py in create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
    353     property_rcbs = [p.resolution_callback for p in property_stubs]
    354 
--> 355     concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
    356 
    357 def create_hooks_from_stubs(concrete_type, hook_stubs, pre_hook_stubs):

RuntimeError: 
Tensor (inferred) cannot be used as a tuple:
  File "/tmp/ipykernel_16172/2935404974.py", line 60
    def forward(self, x):
        x_non_serial, x_serial = x
                                 ~ <--- HERE
        batch_size = x_serial.size(0)

And how I construct my dataset Class is pretty naive. My dataset consists of two kinds of data which have different size. I just simply make these two input data as tuple in __getitem__. The structure of the code looks like below:

 class MyDataset(data.Dataset):
    def __init__(self, mode="train"):
        skip...
    def __getitem__(self, idx):
        return (self.part1[idx], self.part2[idx]), self.target[idx]

As far as I know, the first object returned in __getitem__ will be passed as the input to the model and the second one will be the target to be evaluated with. Beside that, my dataset class won't work with serialize with the error message below:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_16172/950271510.py in <module>
----> 1 serialize(MyDataset, model="train")

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/serializer.py in serialize(cls, *args, **kwargs)
    144         self.op = serialize(MyCustomOp, hidden_units=128)
    145     """
--> 146     return serialize_cls(cls)(*args, **kwargs)
    147 
    148 

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/serializer.py in serialize_cls(cls)
    126     To create an serializable class.
    127     """
--> 128     return _create_wrapper_cls(cls)
    129 
    130 

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/serializer.py in _create_wrapper_cls(cls, store_init_parameters, reset_mutation_uid, stop_parsing)
    114             super().__init__(*args, **kwargs)
    115 
--> 116     wrapper.__module__ = get_module_name(cls)
    117     wrapper.__name__ = cls.__name__
    118     wrapper.__qualname__ = cls.__qualname__

~/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/utils.py in get_module_name(cls_or_func)
     41         # infer the module name with inspect
     42         for frm in inspect.stack():
---> 43             if inspect.getmodule(frm[0]).__name__ == '__main__':
     44                 # main module found
     45                 main_file_path = Path(inspect.getsourcefile(frm[0]))

AttributeError: 'NoneType' object has no attribute '__name__'

I don't know if serialize function matters. @model_wrapper and @basic_unit also show the similar error. I not sure how nni parsing the data. Currently I have no clue where to fix this error. Could you plz give me some direction? Thx

Environment:

NNI version: 2.4
Training service (local|remote|pai|aml|etc): local
Client OS: Linux
Server OS (for remote mode only):
Python version: 3.7.11
PyTorch/TensorFlow version: PyTorch 1.9.1
Is conda/virtualenv/venv used?: conda
Is running in Docker?: yes

ultmaster commented 3 years ago

Are you in jupyter or an ipython environment? model_wrapper-like serializer objects do not work in interactive python for now. You can define them in a separate python file.

ding3820 commented 3 years ago

Hi, Thanks for your prompt response. Yes, I was running on Jupyter notebook. I try to wrap the original model with an outer interface class and use serialize function for my dataset. Then, reformulate the code to python script and it works. So the takeaway message here is to write a python script if you want to follow the graph based execution engine.

But there's another issue on TPEStrategies. I try to run the new python script but here what I got.

Traceback (most recent call last):
  File "nas.py", line 210, in <module>
    exp.run(exp_config, 8118)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 289, in run
    self.start(port, debug)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 260, in start
    self._start_strategy()
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/experiment/pytorch.py", line 193, in _start_strategy
    self.strategy.run(base_model_ir, self.applied_mutators)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/strategy/tpe_strategy.py", line 66, in run
    self.tpe_sampler.update_sample_space(sample_space)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/strategy/tpe_strategy.py", line 26, in update_sample_space
    self.tpe_tuner.update_search_space(search_space)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/algorithms/hpo/hyperopt_tuner.py", line 258, in update_search_space
    pass_expr_memo_ctrl=None)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/hyperopt/base.py", line 790, in __init__
    pyll.toposort(self.expr)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 715, in toposort
    assert order[-1] == expr
IndexError: list index out of range

I also try with Random strategy which doesn't show any error message but it shows fail in the trial panel on WebUI. This is a bit confusing maybe I should share my implementation.

@basic_unit
class FCLayer(nn.Module):
    def __init__(self, input_size, hidden_dim, label, n):
        super().__init__()
        self.net = nn.Sequential(
            nn.LayerChoice([
                nn.BatchNorm1d(input_size),
                nn.Identity()
            ], label=label+'_bn_'+str(n)),
            nn.Linear(input_size, hidden_dim),
            nn.LayerChoice([
                AutoActivation(label=label+'_autoac_'+str(n)),
                nn.Identity()
            ], label=label+'_ac_'+str(n)),
        )

    def forward(self, x):
        out = self.net(x)
        return out

@basic_unit
class LSTMDNN(nn.Module):
    def __init__(self, serial_in, nonserial_in):
        super().__init__()

        serial_hidden_dim = nn.ValueChoice([8, 16, 32, 64], label='serial_hidden')
        nonserial_hidden_dim = nn.ValueChoice([8, 16, 32, 64], label='nonserial_hidden')
        comb_hidden_dim = nn.ValueChoice([16, 32, 64, 128], label='comb_hidden')
        self.lstm_n_layers = 1
        self.lstm = nn.LSTM(input_size=serial_in, hidden_size=7, num_layers=self.lstm_n_layers, batch_first=True)
        serial_nn = []
        serial_nn.append(FCLayer(7, serial_hidden_dim, 'serial', 0))
        for i in range(2):
            serial_nn.append(nn.LayerChoice([
                FCLayer(serial_hidden_dim, serial_hidden_dim, 'serial', i+1),
                nn.Identity()
            ], label='serial_nn_{}'.format(str(i))))
        serial_nn.append(FCLayer(serial_hidden_dim, 16, 'serial', 3))
        self.serial_nn = nn.Sequential(*serial_nn)

        nonserial_nn = []
        nonserial_nn.append(FCLayer(nonserial_in, nonserial_hidden_dim, 'nonserial', 0))
        for i in range(2):
            nonserial_nn.append(nn.LayerChoice([
                FCLayer(nonserial_hidden_dim, nonserial_hidden_dim, 'nonserial', i+1),
                nn.Identity()
            ], label='nonserial_nn_{}'.format(str(i))))
        nonserial_nn.append(FCLayer(nonserial_hidden_dim, 16, 'nonserial', 3))
        self.nonserial_nn = nn.Sequential(*nonserial_nn)

        comb_nn = []
        comb_nn.append(FCLayer(32, comb_hidden_dim, 'comb', 0))
        for i in range(3):
            comb_nn.append(nn.LayerChoice([
                FCLayer(comb_hidden_dim, comb_hidden_dim, 'comb', i+1),
                nn.Identity()
            ], label='comb_nn_{}'.format(str(i))))
        self.comb_nn = nn.Sequential(*comb_nn)

        self.final_fc = nn.Linear(comb_hidden_dim, 2)

    def forward(self, x_non_serial, x_serial):
        batch_size = x_serial.size(0)
        lstm_hidden = self.init_hidden(batch_size)
        # serial data
        lstm_out, lstm_hidden = self.lstm(x_serial, lstm_hidden)
        lstm_out = lstm_out[:, -1, :]  # take the last output
        lstm_out = lstm_out.contiguous().view(-1, 7)

        serial_out = self.serial_nn(lstm_out.clone())
        serial_out = serial_out.view(batch_size, -1)

        # non-serial data
        nonserial_out = self.nonserial_nn(x_non_serial)

        comb_out = torch.cat([serial_out, nonserial_out], dim=1)
        comb_out = self.comb_nn(comb_out)

        out = self.final_fc(comb_out)

        return out

    def init_hidden(self, batch_size: int):

        hidden = (torch.zeros(self.lstm_n_layers, batch_size, 7).cuda(),
                  torch.zeros(self.lstm_n_layers, batch_size, 7).cuda())
        return hidden

class Net(nn.Module):
    def __init__(self, serial_in, nonserial_in):
        super().__init__()
        self.model = LSTMDNN(serial_in, nonserial_in)

    def forward(self, x_non_serial, x_serial):
        out = self.model(x_non_serial, x_serial)
        return out

simple_strategy = strategy.TPEStrategy()
trainer = pl.Classification(train_dataloader=pl.DataLoader(train_dataset, batch_size=256),
                                          val_dataloaders=pl.DataLoader(val_dataset, batch_size=256),
                                          learning_rate=0.001,
                                          weight_decay=0.0001,
                                          max_epochs=2, gpus=[0])
model = Net(7, 84)
exp = RetiariiExperiment(model, trainer, [], simple_strategy)
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'Sepsis search'
exp_config.trial_concurrency = 2
exp_config.max_trial_number = 20
exp_config.trial_gpu_number = 1
exp_config.max_experiment_duration = '1h'
exp_config.execution_engine = 'base'
exp_config.training_service.use_active_gpu = True
export_formatter = 'dict'

exp.run(exp_config, 8118)
print('Final model:')
for model_code in exp.export_top_models(formatter=export_formatter):
    print(model_code)

It would be a great help if you could point a direction for me to fix this problem. Thanks a lot.

ultmaster commented 3 years ago

I also try with Random strategy which doesn't show any error message but it shows fail in the trial panel on WebUI. This is a bit confusing maybe I should share my implementation.

Maybe you should check the log of each trial?

IndexError: list index out of range

This looks strange. If you check the implementation of TPE/Random strategy, there is actually one line that the strategy is trying to parse the search space via dry run. If you can provide that dried-runned search space, it will be much more helpful.

ding3820 commented 3 years ago

Hi, I try with Pure-python Execution Engine where I wrap @model_wrapper outside the whole PyTorch model instead of using@basic_unit. Also, change exp config to:

exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'test search'
exp_config.trial_concurrency = 2
exp_config.max_trial_number = 50
exp_config.max_experiment_duration = '1h'

In this way, TPE does not show any error message in the console. However, the trial on WebUI still shows fail. If the search space you were mentioning is the one on WebUI, it is completely blank.

and here's the log of each trial, it doesn't show any error or the reason why it fails. dispatcher.log nnimanager.log

I think I'm almost there. Probably just few changes could make this work. Thanks!

ding3820 commented 3 years ago

Just find out the stderr for each trial. It seems that the trial is failed due to GPU no found. Here's the complete stderr:

/root/anaconda3/envs/nni/lib/python3.7/site-packages/deprecate/deprecation.py:115: LightningDeprecationWarning: The `Accuracy` was deprecated since v1.3.0 in favor of `torchmetrics.classification.accuracy.Accuracy`. It will be removed in v1.5.0.
  stream(template_mgs % msg_args)
Traceback (most recent call last):
  File "/root/anaconda3/envs/nni/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/nni/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/trial_entry.py", line 25, in <module>
    engine.trial_execute_graph()
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/execution/python.py", line 40, in trial_execute_graph
    graph_data = PythonGraphData.load(receive_trial_parameters())
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/integration_api.py", line 44, in receive_trial_parameters
    params = json_loads(json.dumps(params))
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/nonp.py", line 236, in loads
    return json_loads(string, object_pairs_hook=hook, **jsonkwargs)
  File "/root/anaconda3/envs/nni/lib/python3.7/json/__init__.py", line 361, in loads
    return cls(**kw).decode(s)
  File "/root/anaconda3/envs/nni/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/root/anaconda3/envs/nni/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/decoders.py", line 44, in __call__
    map = hook(map, properties=self.properties)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/json_tricks/utils.py", line 66, in wrapper
    return encoder(*args, **{k: v for k, v in kwargs.items() if k in names})
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/nni/retiarii/serializer.py", line 46, in _serialize_class_instance_decode
    return import_(obj['__type__'])(**obj['arguments'])
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
    return fn(self, **kwargs)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 346, in __init__
    gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _parse_devices
    gpu_ids = device_parser.parse_gpu_ids(gpus)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 91, in parse_gpu_ids
    return _sanitize_gpu_ids(gpus)
  File "/root/anaconda3/envs/nni/lib/python3.7/site-packages/pytorch_lightning/utilities/device_parser.py", line 164, in _sanitize_gpu_ids
    f"You requested GPUs: {gpus}\n But your machine only has: {all_available_gpus}"
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []

However, at the beginning of the process, it did find the GPU:

[2021-10-12 14:01:58] INFO (pytorch_lightning.utilities.distributed/Thread-2) GPU available: True, used: True

ultmaster commented 3 years ago

I think you need to set exp_config.trial_gpu_number = 1. In your log, I see your GPU is completely disabled.

ding3820 commented 3 years ago

Thanks for your respond. I had figured out the gpu issue earlier today. The experiments seem fine now. To sum up all I have done, I follow the instruction in Pure-python Execution Engine and the settings here. But I still don't know how to make Graph-based Execution Engine work. The attempt I made was all about it. The documentation and sample code seem misleading. Anyway, thanks for your contribution. This work is amazing.

microsoft / nni

Multi-trial NAS on custom dataset and model with tuple input #4237