microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.06k stars 1.82k forks source link

NNIv3.0: torch.nn modules not serializable #5594

Closed chachus closed 1 year ago

chachus commented 1 year ago

Describe the issue: When running Hello NAS tutorial for 3.0rc1 i get the following error:

ValueError: Object Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1)) needs to be serializable buttrace_kwargsis not available. If it is a built-in module (like Conv2d), please import it from retiarii.nn. If it is a customized module, please to decorate it with @basic_unit. For other complex objects (e.g., trainer, optimizer, dataset, dataloader), try to use @nni.trace.

doing as the error suggest doesn't work since retiarii in v3.0 is not supported and if i try to import from there it gives error.

Environment: NNI version: 3.0rc1 Training service (local|remote|pai|aml|etc): local Client OS: Ubuntu Server OS (for remote mode only): Python version: 3.10 PyTorch/TensorFlow version: PyTorch 1.13 Is conda/virtualenv/venv used?: yes Is running in Docker?: no

Log message: experiment.log console.txt nnimanager.log

How to reproduce it?: Hello NAS! tutorial for v3.0rc01

ultmaster commented 1 year ago

Are you suggesting that Hello NAS tutorial is not runnable on NNI v3.0? Are you sure you are using both the latest NNI and latest tutorial? Could you post the link?

chachus commented 1 year ago

Yes, i installed it through pip, version 3.0rc1. If i print the version it's correct. The tutorial used is https://nni.readthedocs.io/en/v3.0rc1/tutorials/hello_nas.html. I tried to run it like this

# config = NasExperimentConfig("ts", "graph", "local")
# config.experiment_name = "mnist_test"
# config.max_trial_number = 3  # spawn 3 trials at most
# config.trial_concurrency = 1  # will run 1 trial concurrently
# config.trial_gpu_number = 0
# config.training_service.use_active_gpu = True
exp = NasExperiment(model, evaluator, search_strategy)

and probabily was the cause. I suspected there was an error in the autocreation of config into the NasExperiment class. But when i used the NasExperimentConfig.default(), i have the error reported in my other issues i posted. If i use the tutorial as is what happens is:

[2023-06-02 10:47:01] Config is not provided. Will try to infer.
[2023-06-02 10:47:01] Using execution engine based on training service. Trial concurrency is set to 1.
[2023-06-02 10:47:01] Using simplified model format.
[2023-06-02 10:47:01] Using local training service.
[2023-06-02 10:47:01] WARNING: GPU found but will not be used. Please set `experiment.config.trial_gpu_number` to the number of GPUs you want to use for each trial.
[2023-06-02 10:47:01] Creating experiment, Experiment ID: 1v85b07z
[2023-06-02 10:47:02] Starting web server...
[2023-06-02 10:47:05] ERROR: Create experiment failed: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbe3cb8a860>: Failed to establish a new connection: [Errno 111] Connection refused'))

The waring GPU found but not used appears even though trial_gpu_number is set to 1. After this the pc slows down because the process is not killed even after the error (the connection port results still occupied), and leaves a ton of zombie processes of collect_gpu_info, as reported in the other issues.

ultmaster commented 1 year ago

I think the problem might be "graph" model format in your configuration. Any specific reason to use it? To use the graph format, you might need to use the legacy Conv2d API instead of MutableConc2d.

My suggestion is to use "simplified" model format instead. See https://nni.readthedocs.io/en/latest/nas/execution_engine.html

qmpzzpmq commented 1 year ago

also the lightning.pytorch.trainer.trainer.Trainer is not support as well at 3.0

ultmaster commented 1 year ago

Could you elaborate on why it's not supported?

Lijiaoa commented 1 year ago

Did you have any updates for it? @chachus

chachus commented 1 year ago

Yes the problem was using config = NasExperimentConfig("ts", "graph", "local"). But still i encounter the bug referred in https://github.com/microsoft/nni/issues/5574 so i didn't do any other tests, waiting for the new version of NNIv3. I'll close this ticket