Open idekazuki opened 4 years ago
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_apex.html
上記のサイトに従って、miniconda 上で環境構築を行った。 尚、予め export CUDA_HOME="/usr/local/cuda-9.0" をbashrcに記述し、cudaのversion をAPEX対応の9 overにあげておいた。
conda install apex ではなく、conda install nvidia-apexで設定できた。
テストとしてdcgan.の訓練を行った。 -opt_level O1 [24/25][780/782] Loss_D: 0.0161 Loss_G: 5.2606 D(x): 6.3242 D(G(z)): -4.8828 / -5.2500 [24/25][781/782] Loss_D: 0.0085 Loss_G: 7.6221 D(x): 5.2227 D(G(z)): -7.6641 / -7.6211 170500096it [17:06, 166149.96it/s]
--opt_level O0
python dcgan.py --batch_size 256 --ngpu 4 Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
O1 06:06.48
O0 06:32.65
start from the beginning
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "apex_run.py", line 150, in <module>
net, optimizer = amp.initialize(net, optimizer, opt_level='O1')
File "/home/yanai-lab/ide-k/ide-k/pyenv/apex/lib/python3.6/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/home/yanai-lab/ide-k/ide-k/pyenv/apex/lib/python3.6/site-packages/apex/amp/_initialize.py", line 167, in _initialize
check_models(models)
File "/home/yanai-lab/ide-k/ide-k/pyenv/apex/lib/python3.6/site-packages/apex/amp/_initialize.py", line 74, in check_models
"Parallel wrappers should only be applied to the model(s) AFTER \n"
RuntimeError: Incoming model is an instance of torch.nn.parallel.DataParallel. Parallel wrappers should only be applied to the model(s) AFTER
the model(s) have been returned from amp.initialize.
O2:time:412.0059335231781 O1:time:479.77091670036316