validation_step doesn't work when using lightning

SunlightBoi commented 2 years ago

Describe the issue: The validation_step() function in customized pytorch_lightning module doesn't work, which causes a KeyError 'val_loss' in teardown() function. Here is my customized module:

import nni.retiarii.evaluator.pytorch.lightning as pl
import nni.retiarii.nn.pytorch as nn
from nni.retiarii.experiment.pytorch import RetiariiExperiment
from nni.retiarii.strategy import Proxyless
# Module
@nni.trace
class MyModule(pl.LightningModule):
    def __init__(self, lr):
        super().__init__()
        self.lr = lr
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        y = self.model(x)
        return y

    def shared_step(self, batch, stage):
        x, y = batch
        y_hat = self.model(x)
        loss = self.loss(y_hat, y.squeeze(1))
        self.log(f'{stage}_loss', loss, prog_bar=True)
        return loss

    def training_step(self, batch, batch_idx):
        return self.shared_step(batch, 'train')

    def validation_step(self, batch, batch_idx):
        print('validation_step')
        return self.shared_step(batch, 'val')

    def configure_optimizers(self):
        opt = torch.optim.AdamW(self.model.parameters(), lr = self.lr)
        return opt

    def on_validation_epoch_end(self):
        nni.report_intermediate_result(self.trainer.callback_metrics['val_loss'].item())

    def teardown(self, stage):
        if stage == 'fit':
            nni.report_final_result(self.trainer.callback_metrics['val_loss'].item())

part of main:

base_model = MyModel(1, 64)
evaluator = pl.Lightning(
    lightning_module=MyModule(lr=1e-4), 
    trainer=pl.Trainer(
        max_epochs=1, 
        devices=[0], 
        accelerator='gpu',
        log_every_n_steps=1,
        ), 
    train_dataloaders=datamodule.train_dataloader(), # len 9
    val_dataloaders=datamodule.val_dataloader() # len 1
)
strategy = Proxyless()
experiment = RetiariiExperiment(base_model, evaluator, [], strategy)
experiment.run()

console information:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]

  | Name  | Type     | Params
-----------------------------------
0 | model | MyModule | 131 K 
-----------------------------------
131 K     Trainable params
0         Non-trainable params
131 K     Total params
0.527     Total estimated model params size (MB)
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████| 9/9 [00:26<00:00,  2.99s/it, v_num=2, train_loss=4.810]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████| 9/9 [00:26<00:00,  2.99s/it, v_num=2, train_loss=4.810]
Traceback (most recent call last):
  File "/home/aa/main.py", line 138, in <module>
    experiment.run()
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/nni/nas/experiment/pytorch.py", line 280, in run
    self.strategy.run(base_model_ir, self.applied_mutators)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/nni/nas/oneshot/pytorch/strategy.py", line 89, in run
    evaluator.trainer.fit(self.model, train_loader, val_loader)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1180, in _run
    self._call_teardown_hook()
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1477, in _call_teardown_hook
    self._call_lightning_module_hook("teardown", stage=fn)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/aa/anaconda3/envs/nni/lib/python3.8/site-packages/nni/nas/oneshot/pytorch/base_lightning.py", line 404, in teardown
    return self.model.teardown(stage)
  File "/home/aa/main.py", line 113, in teardown
    nni.report_final_result(self.trainer.callback_metrics['val_loss'].item())
KeyError: 'val_loss'

Environment:

NNI version: 2.9
Training service (local|remote|pai|aml|etc): local
Client OS: ubuntu 18.04
Server OS (for remote mode only):
Python version: 3.8
PyTorch/TensorFlow version: 1.12.1 (pytorch-lightning 1.7.7)
Is conda/virtualenv/venv used?: conda
Is running in Docker?: no

ultmaster commented 2 years ago

I'm not sure whether the train and val dataloader here have been properly serialized and sent to trials, because none of them is wrapped with nni.trace. As you can see in the documentation, all the parameters of pl.Lightning needs to be traceable, so that they can be restored in another process.

One more tip: you can actually use datamodule directly rather train/val dataloaders because there is a fit_kwargs supported by pl.Lightning.

If everything is configured correctly, validation_step should run.

ultmaster commented 2 years ago

Oh, sorry I didn't realize you are using Proxyless. I thought it was multi-trial. Proxyless doesn't use validation_step by default, because it always need gradients. This is by design.

Doc improvements are always welcome.

SunlightBoi commented 2 years ago

Oh, sorry I didn't realize you are using Proxyless. I thought it was multi-trial. Proxyless doesn't use validation_step by default, because it always need gradients. This is by design.

Doc improvements are always welcome.

Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'? But how can i validate the model after every training epoch through Proxyless? Is there any example? Besides, i tried the Multi-trial strategy as follow:

strategy = Random()
experiment = RetiariiExperiment(base_model, evaluator, [], strategy)
config = RetiariiExeConfig('local')
config.trial_concurrency = 2
config.max_trial_number = 4
config.trial_gpu_number = 1
config.training_service.use_active_gpu = True
experiment.run(config)
experiment.stop()

but the search process finished immediately without any training:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:08:52] Creating experiment, Experiment ID: 2tr6bu8j
[2022-10-16 22:08:52] Starting web server...
[2022-10-16 22:08:53] Setting up...
[2022-10-16 22:08:53] Web portal URLs: http://127.0.0.1:8080 http://192.168.123.12:8080
[2022-10-16 22:08:53] Dispatcher started
[2022-10-16 22:08:53] Start strategy...
[2022-10-16 22:08:53] Successfully update searchSpace.
[2022-10-16 22:08:53] Random search running in fixed size mode. Dedup: on.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:09:43] Strategy exit
[2022-10-16 22:09:43] Search process is done, the experiment is still alive, `stop()` can terminate the experiment.
[2022-10-16 22:09:43] Stopping experiment, please wait...
[2022-10-16 22:09:43] Dispatcher exiting...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:10:04] Dispatcher terminiated
[2022-10-16 22:10:04] Experiment stopped

and got an additional text '1 failed models are ignored. will retry' when i set the strategy to RegularizedEvolution.

matluster commented 2 years ago

Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'?

You don't need to "report_xxx_result" in one-shot strategy. It's not used.

But how can i validate the model after every training epoch through Proxyless? Is there any example?

You have to retrain. See this tutorial.

the search process finished immediately without any training:

As I said previously, your serialization might not be correct. Try to add some nni.trace following my suggestion.

SunlightBoi commented 2 years ago

Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'?

You don't need to "report_xxx_result" in one-shot strategy. It's not used.

But how can i validate the model after every training epoch through Proxyless? Is there any example?

You have to retrain. See this tutorial.

the search process finished immediately without any training:

As I said previously, your serialization might not be correct. Try to add some nni.trace following my suggestion.

Thanks. It's really helpful

ekurtgl commented 1 year ago

Hi @matluster , @ultmaster ,

I am facing the same problem with DARTS. It doesn't call validation_step() function. Is there a way to evaluate our model after every epoch or step? Thank you.

microsoft / nni

validation_step doesn't work when using lightning #5160