Open dtamienER opened 2 months ago
I made it using a devcontainer with version 2.7 of nni
Dockerfile
FROM msranni/nni:v2.7
RUN pip install matplotlib tensorflow_datasets dill
Still having problems with version 3.0
I have somewhat similar issue.
authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
builtinTunerName: Random
classArgs:
optimize_mode: minmize
trial:
command: python train.py
codeDir: .
when i work in this fashion, the code runs on CPU. But when I run the code as follow:
authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
builtinTunerName: Random
classArgs:
optimize_mode: minmize
trial:
command: python train.py
codeDir: .
gpuNum: 1
localConfig:
useActiveGpu: false
It creates 800+ python files and the link doesn't open anymore. It either crashes my PC (because of those multiple files) or the link will have Running 0. Why?
I am having the same problem as Rajesh90123.
Description of the issue
I cannot run any experiment on GPU.
I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.
I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.
I have tried both outside and inside a Docker container.
Configuration
nni/examples/trials/mnist-pytorch/config.yml
Outside a Docker container
Environment
Log message
nnimanager.log
There, the GPU's infos cannot be retreived.
experiment.log
There is a timeout since data cannot be retreived.
Inside a Docker container
Dockerfile
Log message
nnimanager.log
experiment.log
When I'm using CPU only:
I obtain what I want using the GPU, the WebUI, the experiments trials, and so on...
How to reproduce it?
If from a Docker container:
Then in both cases:
As a result, the WebUI wouldn't start due to a timeout trying to retrive data, since the experiment won't load on GPU.
Notes