Open AL3708 opened 2 years ago
This is indeed a mis-handled case.
However, ProxylessTrainer has been deprecated, and thus we don't have hands on fixing this issue. This is an unfortunate fact, but you can try to fix it and contribute back if you are interested.
You might want to try the latest version (v2.9).
ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.
Ex. If block is used:
class ConvBlock(nn.Module): def __init__(self, in_channels: int, out_channels: int): super().__init__() self.block = nn.LayerChoice(OrderedDict([ # conv block is standard Conv-bn-act ('3x3', ConvBlock(in_channels, out_channels, kernel_size=3)), ('1x3', ConvBlock(in_channels, out_channels, kernel_size=(1, 3))), ('3x1', ConvBlock(in_channels, out_channels, kernel_size=(3, 1))), ('3x3_sep', ConvBlock(in_channels, out_channels, kernel_size=3, groups=in_channels)), ('identity', Identity()) ]))
Then an error is thrown:
Traceback (most recent call last): File "C:\Users\...\proxylessnas.py", line 373, in <module> main() File "C:\Users\...\proxylessnas.py", line 359, in main trainer.fit() File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 363, in fit self._train_one_epoch(i) File "C:\Users\...\proxylessnas.py", line 295, in _train_one_epoch logits, loss = self._logits_and_loss_for_arch_update(val_X, val_y) File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 330, in _logits_and_loss_for_arch_update expected_latency = self.latency_estimator.cal_expected_latency(current_architecture_prob) File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in cal_expected_latency lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)] File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in <listcomp> lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)] KeyError: '0'
Environment:
- NNI version: 2.8
- Training service (local|remote|pai|aml|etc): local
- Client OS: Windows 10
- Python version: 3.10
- PyTorch version: 1.12
- Is conda/virtualenv/venv used?: Pipenv
- Is running in Docker?: No
@AL3708 - had you get a chance to upgrade your nni to 2.9?
feel free to reopen if you have any other question. @AL3708
ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.
Ex. If block is used:
Then an error is thrown:
Environment: