Closed SunlightBoi closed 2 years ago
I'm not sure whether the train and val dataloader here have been properly serialized and sent to trials, because none of them is wrapped with nni.trace. As you can see in the documentation, all the parameters of pl.Lightning needs to be traceable, so that they can be restored in another process.
One more tip: you can actually use datamodule directly rather train/val dataloaders because there is a fit_kwargs supported by pl.Lightning.
If everything is configured correctly, validation_step should run.
Oh, sorry I didn't realize you are using Proxyless. I thought it was multi-trial. Proxyless doesn't use validation_step by default, because it always need gradients. This is by design.
Doc improvements are always welcome.
Oh, sorry I didn't realize you are using Proxyless. I thought it was multi-trial. Proxyless doesn't use validation_step by default, because it always need gradients. This is by design.
Doc improvements are always welcome.
Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'? But how can i validate the model after every training epoch through Proxyless? Is there any example? Besides, i tried the Multi-trial strategy as follow:
strategy = Random()
experiment = RetiariiExperiment(base_model, evaluator, [], strategy)
config = RetiariiExeConfig('local')
config.trial_concurrency = 2
config.max_trial_number = 4
config.trial_gpu_number = 1
config.training_service.use_active_gpu = True
experiment.run(config)
experiment.stop()
but the search process finished immediately without any training:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:08:52] Creating experiment, Experiment ID: 2tr6bu8j
[2022-10-16 22:08:52] Starting web server...
[2022-10-16 22:08:53] Setting up...
[2022-10-16 22:08:53] Web portal URLs: http://127.0.0.1:8080 http://192.168.123.12:8080
[2022-10-16 22:08:53] Dispatcher started
[2022-10-16 22:08:53] Start strategy...
[2022-10-16 22:08:53] Successfully update searchSpace.
[2022-10-16 22:08:53] Random search running in fixed size mode. Dedup: on.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:09:43] Strategy exit
[2022-10-16 22:09:43] Search process is done, the experiment is still alive, `stop()` can terminate the experiment.
[2022-10-16 22:09:43] Stopping experiment, please wait...
[2022-10-16 22:09:43] Dispatcher exiting...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2022-10-16 22:10:04] Dispatcher terminiated
[2022-10-16 22:10:04] Experiment stopped
and got an additional text '1 failed models are ignored. will retry' when i set the strategy to RegularizedEvolution.
Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'?
You don't need to "report_xxx_result" in one-shot strategy. It's not used.
But how can i validate the model after every training epoch through Proxyless? Is there any example?
You have to retrain. See this tutorial.
the search process finished immediately without any training:
As I said previously, your serialization might not be correct. Try to add some nni.trace
following my suggestion.
Is that mean i can just remove the validation_step() and on_validation_epoch_end(), and change 'val_loss' to 'train_loss'?
You don't need to "report_xxx_result" in one-shot strategy. It's not used.
But how can i validate the model after every training epoch through Proxyless? Is there any example?
You have to retrain. See this tutorial.
the search process finished immediately without any training:
As I said previously, your serialization might not be correct. Try to add some
nni.trace
following my suggestion.
Thanks. It's really helpful
Hi @matluster , @ultmaster ,
I am facing the same problem with DARTS. It doesn't call validation_step()
function. Is there a way to evaluate our model after every epoch or step? Thank you.
Describe the issue: The validation_step() function in customized pytorch_lightning module doesn't work, which causes a KeyError 'val_loss' in teardown() function. Here is my customized module:
part of main:
console information:
Environment: