microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14k stars 1.81k forks source link

Can't use OrderedDict inside nn.LayerChioce when using ProxylessTrainer #5079

Open AL3708 opened 2 years ago

AL3708 commented 2 years ago

ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.

Ex. If block is used:

class ConvBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.block = nn.LayerChoice(OrderedDict([
            # conv block is standard Conv-bn-act
            ('3x3', ConvBlock(in_channels, out_channels, kernel_size=3)),
            ('1x3', ConvBlock(in_channels, out_channels, kernel_size=(1, 3))),
            ('3x1', ConvBlock(in_channels, out_channels, kernel_size=(3, 1))),
            ('3x3_sep', ConvBlock(in_channels, out_channels, kernel_size=3, groups=in_channels)),
            ('identity', Identity())
        ]))

Then an error is thrown:

Traceback (most recent call last):
  File "C:\Users\...\proxylessnas.py", line 373, in <module>
    main()
  File "C:\Users\...\proxylessnas.py", line 359, in main
    trainer.fit()
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 363, in fit
    self._train_one_epoch(i)
  File "C:\Users\...\proxylessnas.py", line 295, in _train_one_epoch
    logits, loss = self._logits_and_loss_for_arch_update(val_X, val_y)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 330, in _logits_and_loss_for_arch_update
    expected_latency = self.latency_estimator.cal_expected_latency(current_architecture_prob)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in cal_expected_latency
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in <listcomp>
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
KeyError: '0'

Environment:

ultmaster commented 2 years ago

This is indeed a mis-handled case.

However, ProxylessTrainer has been deprecated, and thus we don't have hands on fixing this issue. This is an unfortunate fact, but you can try to fix it and contribute back if you are interested.

matluster commented 2 years ago

You might want to try the latest version (v2.9).

scarlett2018 commented 1 year ago

ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.

Ex. If block is used:

class ConvBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.block = nn.LayerChoice(OrderedDict([
            # conv block is standard Conv-bn-act
            ('3x3', ConvBlock(in_channels, out_channels, kernel_size=3)),
            ('1x3', ConvBlock(in_channels, out_channels, kernel_size=(1, 3))),
            ('3x1', ConvBlock(in_channels, out_channels, kernel_size=(3, 1))),
            ('3x3_sep', ConvBlock(in_channels, out_channels, kernel_size=3, groups=in_channels)),
            ('identity', Identity())
        ]))

Then an error is thrown:

Traceback (most recent call last):
  File "C:\Users\...\proxylessnas.py", line 373, in <module>
    main()
  File "C:\Users\...\proxylessnas.py", line 359, in main
    trainer.fit()
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 363, in fit
    self._train_one_epoch(i)
  File "C:\Users\...\proxylessnas.py", line 295, in _train_one_epoch
    logits, loss = self._logits_and_loss_for_arch_update(val_X, val_y)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 330, in _logits_and_loss_for_arch_update
    expected_latency = self.latency_estimator.cal_expected_latency(current_architecture_prob)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in cal_expected_latency
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in <listcomp>
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
KeyError: '0'

Environment:

  • NNI version: 2.8
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Windows 10
  • Python version: 3.10
  • PyTorch version: 1.12
  • Is conda/virtualenv/venv used?: Pipenv
  • Is running in Docker?: No

@AL3708 - had you get a chance to upgrade your nni to 2.9?

Lijiaoa commented 1 year ago

feel free to reopen if you have any other question. @AL3708