Training the problem - Githubissues

SHOUshou0426 commented 2 years ago

Hello, there is a cutoff in the training data 999 epoch Evaluating k-NN accuracy. appear error:ValueError: range() arg 3 must not be zero but my train afhq datasets likewise error :ValueError: range() arg 3 must not be zero

Traceback (most recent call last): File "train.py", line 258, in File "train.py", line 254, in main File "train.py", line 190, in training_loop File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\metrics\knn_evaluator.py", line 69, in evaluate top1, top5 = knn_classifier( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\metrics\knn_evaluator.py", line 106, in knn_classifier for idx in range(0, num_test_images, imgs_per_chunk): ValueError: range() arg 3 must not be zero**

this is my print num_test_images, num_chunks = test_labels.shape[0], 100 num_test_images = 32；

imgs_per_chunk = num_test_images // num_chunks imgs_per_chunk = 0

environment：torch=1.11.0+cu113 cuda=11.3

SHOUshou0426 commented 2 years ago

I use the command like this: python train.py --mod-type adain --total-nimg 1.6M --batch-size 4 --load-size 320 --crop-size 256 --image-size 256 --train-dataset datasets/l2l_cloth/train --eval-dataset datasets/l2l_cloth/val --out-dir runs --extra-desc some descriptions

kunheek commented 2 years ago

Hi, can you tell me how many images you are using for your test image? I guess it happens when number of your validation set is less than 100.

SHOUshou0426 commented 2 years ago

train is 130 images val is 32 images but my use AFHQ dataset appear error ValueError: range() arg 3 must not be zero

SHOUshou0426 commented 2 years ago

train AFHQ dataset is 19999 epoch appear error

kunheek commented 2 years ago

I see. Can you share the command used for AFHQ dataset? I will reproduce it myself. Until the problem is fixed, you can train your model without evaluation by adding --evaluation false to the command. You can evaluate if after training using saved checkpoints.

By the way, due to the use of SwAV, I recommend to use batch size larger than 4 (16 will be enough). Also, 130 images may not be enough if you are training a model from scratch.

SHOUshou0426 commented 2 years ago

I use AFHQ dataset order is python train.py --mod-type adain --total-nimg 1.6M --batch-size 16 --load-size 320 --crop-size 256 --image-size 256 --train-dataset datasets/afhq/train --eval-dataset datasets/afhq/val --out-dir runs --extra-desc some descriptions

your use metrics order not appear error ? my attempt readme order is python -m metrics fid reconstruction --seed 123 --checkpoint ./checkpoints/afhq-stylegan2-5M.pt --train-dataset ./datasets/afhq/train --eval-dataset ./datasets/afhq/val

is error

*C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py:322: UserWarning: Error checking compiler version for cl: [WinError 2] 系统找不到指定的文件。 warnings.warn(f'Error checking compiler version for {compiler}: {error}') 信息: 用提供的模式无法找到文件。 Traceback (most recent call last): File "C:\Users\yuanx.conda\envs\style2\lib\runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\yuanx.conda\envs\style2\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\metrics__main.py", line 88, in main() File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\metrics__main__.py", line 70, in main model = StyleAwareDiscriminator(opts) File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\mylib\base_model.py", line 23, in init self._create_networks() File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\model\model.py", line 72, in _create_networks self.G = Generator( File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\model\networks\generator.py", line 42, in init from .stylegan2_layers import EncodeBlock, StyleBlock File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\model\networks\stylegan2_layers.py", line 5, in import model.networks.stylegan2_op as ops File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\model\networks\stylegan2_op\init__.py", line 1, in from .fused_act import FusedLeakyReLU, fused_leaky_relu File "C:\Users\yuanx\Desktop\style\style-aware-discriminator\model\networks\stylegan2_op\fused_act.py", line 10, in fused = load( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py", line 1144, in load return _jit_compile( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py", line 1357, in _jit_compile _write_ninja_file_and_build_library( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py", line 1456, in _write_ninja_file_and_build_library _write_ninja_file_to_build_library( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py", line 1898, in _write_ninja_file_to_build_library _write_ninja_file( File "C:\Users\yuanx.conda\envs\style2\lib\site-packages\torch\utils\cpp_extension.py", line 2023, in _write_ninja_file cl_paths = subprocess.check_output(['where', File "C:\Users\yuanx.conda\envs\style2\lib\subprocess.py", line 411, in check_output return run(popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\Users\yuanx.conda\envs\style2\lib\subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.**

kunheek commented 2 years ago

Note that custom CUDA kernel only works on Linux. It seems that you are using Windows.

SHOUshou0426 commented 2 years ago

Does the command need to be modified

SHOUshou0426 commented 2 years ago

The equipment is not enough, the use of less than 16batchsize will affect the effect

kunheek commented 2 years ago

You can use --mod-type=adain, but I cannot guarantee that it will work as I have never tested the code on Windows. I recommend you to run the code on Linux (you can use WSL if you are familiar with it).

In general, the larger the batch size, the better. I haven't tested the code with batch size smaller than 16, so I can't tell you the results of smaller batch sizes.

SHOUshou0426 commented 2 years ago

train is --mod_type=adain no problem，The metrics use readme command is faulty

kunheek commented 2 years ago

This is not because there is a problem, but because the '--mod-type' is automatically set according to the checkpoint used. Checkpoint 'afhq-stylegan2-5M.pt ' is the model trained using --mod-type=stylegan2.

SHOUshou0426 commented 2 years ago

thank you for the response,Try to WSL

kunheek / style-aware-discriminator

Training the problem #6