mit-han-lab / data-efficient-gans

[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
https://arxiv.org/abs/2006.10738
BSD 2-Clause "Simplified" License
1.27k stars 175 forks source link

nvcc fails on windows #47

Open TashaSkyUp opened 3 years ago

TashaSkyUp commented 3 years ago

op.h exists nowhere in my python environment.. where do I get it?

(tf115py37) c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2>python run_low_shot.py --dataset=100-shot-obama --resolution=64 2020-12-20 02:40:10.999131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll Local submit - run_dir: results\00018-DiffAugment-stylegan2-100-shot-obama-64-batch16-1gpu-color-translation-cutout dnnlib: Running training.training_loop.training_loop() on localhost... 2020-12-20 02:40:13.518280: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2020-12-20 02:40:13.523440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll 2020-12-20 02:40:13.862532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.8 pciBusID: 0000:2d:00.0 2020-12-20 02:40:13.862964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: GeForce GTX 970 major: 5 minor: 2 memoryClockRate(GHz): 1.342 pciBusID: 0000:23:00.0 2020-12-20 02:40:13.864667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll 2020-12-20 02:40:13.867997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll 2020-12-20 02:40:13.870859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll 2020-12-20 02:40:13.872201: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll 2020-12-20 02:40:13.876274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll 2020-12-20 02:40:13.878809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll 2020-12-20 02:40:13.885505: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-12-20 02:40:13.885683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1 2020-12-20 02:40:17.775727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-12-20 02:40:17.775907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 1 2020-12-20 02:40:17.777511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N N 2020-12-20 02:40:17.777982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1: N N 2020-12-20 02:40:17.778962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8550 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:2d:00.0, compute capability: 8.6) 2020-12-20 02:40:17.781185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 3000 MB memory) -> physical GPU (device: 1, name: GeForce GTX 970, pci bus id: 0000:23:00.0, compute capability: 5.2) Streaming data using training.dataset.TFRecordDataset... Dataset shape = [3, 64, 64] Dynamic range = [0, 255] Label size = 0 Constructing networks... Setting up TensorFlow plugin "fused_bias_act.cu": Preprocessing... Failed! Traceback (most recent call last): File "run_low_shot.py", line 171, in main() File "run_low_shot.py", line 165, in main run(vars(args)) File "run_low_shot.py", line 94, in run dnnlib.submit_run(kwargs) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\submission\submit.py", line 343, in submit_run return farm.submit(submit_config, host_run_dir) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\submission\internal\local.py", line 22, in submit return run_wrapper(submit_config) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\submission\submit.py", line 280, in run_wrapper run_func_obj(submit_config.run_func_kwargs) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\training_loop.py", line 155, in training_loop G = tflib.Network('G', num_channels=training_set.shape[0], resolution=training_set.shape[1], label_size=training_set.label_size, G_args) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\network.py", line 97, in init self._init_graph() File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\network.py", line 154, in _init_graph out_expr = self._build_func(self.input_templates, build_kwargs) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\networks_stylegan2.py", line 195, in G_main components.synthesis = tflib.Network('G_synthesis', func_name=globals()[synthesis_func], kwargs) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\network.py", line 97, in init self._init_graph() File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\network.py", line 154, in _init_graph out_expr = self._build_func(self.input_templates, **build_kwargs) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\networks_stylegan2.py", line 396, in G_synthesis_stylegan2 x = layer(x, layer_idx=0, fmaps=nf(1), kernel=3) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\networks_stylegan2.py", line 358, in layer x = modulated_conv2d_layer(x, dlatents_in[:, layer_idx], fmaps=fmaps, kernel=kernel, up=up, resample_kernel=resample_kernel, fused_modconv=fused_modconv, impl=impl) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\networks_stylegan2.py", line 106, in modulated_conv2d_layer s = apply_bias_act(s, bias_var=mod_bias_var, impl=impl) + 1 # [BI] Add bias (initially 1). File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\training\networks_stylegan2.py", line 72, in apply_bias_act return fused_bias_act(x, b=tf.cast(b, x.dtype), act=act, alpha=alpha, gain=gain, impl=impl) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\ops\fused_bias_act.py", line 68, in fused_bias_act return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\ops\fused_bias_act.py", line 122, in _fused_bias_act_cuda cuda_kernel = _get_plugin().fused_bias_act File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\ops\fused_bias_act.py", line 16, in _get_plugin return custom_ops.get_plugin(os.path.splitext(file)[0] + '.cu') File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\custom_ops.py", line 115, in get_plugin _run_cmd(_prepare_nvcc_cli('"%s" --preprocess -o "%s" --keep --keep-dir "%s"' % (cuda_file, tmp_file, tmp_dir))) File "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\custom_ops.py", line 61, in _run_cmd raise RuntimeError('NVCC returned an error. See below for full command line and output log:\n\n%s\n\n%s' % (cmd, output)) RuntimeError: NVCC returned an error. See below for full command line and output log:

nvcc "c:\Programming\PythonNotebooks\DataEfficientGans\data-efficient-gans\DiffAugment-stylegan2\dnnlib\tflib\ops\fused_bias_act.cu" --preprocess -o "C:\Users\Tasha\AppData\Local\Temp\tmpjd6q1ev3\fused_bias_act_tmp.cu" --keep --keep-dir "C:\Users\Tasha\AppData\Local\Temp\tmpjd6q1ev3" --disable-warnings --include-path "C:\Users\Tasha.conda\envs\tf115py37\lib\site-packages\tensorflow_core\include" --include-path "C:\Users\Tasha.conda\envs\tf115py37\lib\site-packages\tensorflow_core\include\external\protobuf_archive\src" --include-path "C:\Users\Tasha.conda\envs\tf115py37\lib\site-packages\tensorflow_core\include\external\com_google_absl" --include-path "C:\Users\Tasha.conda\envs\tf115py37\lib\site-packages\tensorflow_core\include\external\eigen_archive" --include-path "C:\Users\Tasha\AppData\Local\Programs\Python\Python37\Lib\site-packages\tensorflow_core\include\tensorflow_core\core\framework" --compiler-bindir "C:/Program Files (x86)/Microsoft Visual Studio 14.0/vc/bin" 2>&1

fused_bias_act.cu c:/Programming/PythonNotebooks/DataEfficientGans/data-efficient-gans/DiffAugment-stylegan2/dnnlib/tflib/ops/fused_bias_act.cu(9): fatal error C1083: Cannot open include file: 'tensorflow_core/core/framework/op.h': No such file or directory

TashaSkyUp commented 3 years ago

Actually, op.h exists in c:\Users\Tasha.conda\envs\tf115py37\Lib\site-packages\tensorflow-1.15.0.data\purelib\tensorflow_core\include\tensorflow_core\core\framework

the question is why does: c:/Programming/PythonNotebooks/DataEfficientGans/data-efficient-gans/DiffAugment-stylegan2/dnnlib/tflib/ops/fused_bias_act.cu

not find it?

TashaSkyUp commented 3 years ago

I've been working on this for many hours.. Its like all the includes in the two .cu files are pointing to the wrong location for the .h files.. but then that collection of .h files also points to the same wrong directory for more .h files..

zsyzzsoft commented 3 years ago

There may be strange issues on Windows... If you are training from scratch, you may specify --impl=ref as a workaround.

Brand4 commented 3 years ago

I've been working on this for many hours.. Its like all the includes in the two .cu files are pointing to the wrong location for the .h files.. but then that collection of .h files also points to the same wrong directory for more .h files..

I replace the "tensorflow/core/framework/op.h" with "c:\Users\Tasha.conda\envs\tf115py37\Lib\site-packages\tensorflow-1.15.0.data\purelib\tensorflow_core\include\tensorflow_core\core\framework\op.h", then other .h files not found... I see some solutions on this problem,such as this.but it doesn't work. How do you solve it?

TashaSkyUp commented 3 years ago

@zsyzzsoft Thank you that seems to do the trick. It appears to be training now. Can I suggest that a bit of information be added for windows users so other people can avoid digging through .h files needlessly for 6 hours?

zsyzzsoft commented 3 years ago

Yeah, it was added that

If you are facing problems with nvcc (when building custom ops of StyleGAN2), this can be circumvented by specifying --impl=ref in training at the cost of a slightly longer training time.

Brand4 commented 3 years ago

There may be strange issues on Windows... If you are training from scratch, you may specify --impl=ref as a workaround.

When specify --impl=ref in training, this issue is solved.However, after finishing the training, I use this model to generate new pictures, this issue appeared again.("op.h" file not find.) The generation code is copy from your colab tutorial. This issue will also appear when using your pre-training model. How to solve this problem when I want to generate pictures in Windows?Thank you.

zsyzzsoft commented 3 years ago

There may be strange issues on Windows... If you are training from scratch, you may specify --impl=ref as a workaround.

When specify --impl=ref in training, this issue is solved.However, after finishing the training, I use this model to generate new pictures, this issue appeared again.("op.h" file not find.) The generation code is copy from your colab tutorial. This issue will also appear when using your pre-training model. How to solve this problem when I want to generate pictures in Windows?Thank you.

Please add impl='ref' to the argument list of Gs.run

nigelparsad commented 3 years ago

I solved this error as follows:

This same error was observed using a Windows 10 Anaconda environment.

In op.h and other header files in Anaconda's Windows version of TF, a bunch of includes use this relative path:

#include "tensorflow/core/..

These lines must be manually changed to:

_#include "tensorflowcore/core..

This should be obvious because if you see the Anaconda environment path to core, "tensorflow_core" is its parent directory.

I did a batch find and replace from the ._..tensorflowcore/core/.. directory level and both cuda files compiled WITHOUT having to use the impl=ref flag. Training is faster.

Unfortunately, the NCCL library is not available for Windows so multi-gpu training is limited.

Victoria-1 commented 2 years ago

There may be strange issues on Windows... If you are training from scratch, you may specify --impl=ref as a workaround.

When specify --impl=ref in training, this issue is solved.However, after finishing the training, I use this model to generate new pictures, this issue appeared again.("op.h" file not find.) The generation code is copy from your colab tutorial. This issue will also appear when using your pre-training model. How to solve this problem when I want to generate pictures in Windows?Thank you.

Please add impl='ref' to the argument list of Gs.run

I got “unistd.h”: No such file or directory on windows. I tried to add impl='ref but it did not work.