FID calculation error: ValueError

sihouzi21c commented 12 months ago

After I run 'search_dynamic_unet_imagenet64_classifier_guidance_progressive.sh', I get 'ValueError: Imaginary component' which is caused by the 'frechet_distance' function in 'evaluator_v1.py'. I fix the problem with num_samples modified to 2048, as [1] said number of samples should be greater than 2048 to avoid the problem. Is there any method to fix the problem while maintain the num_samples to 1000?

[1]https://github.com/lucidrains/denoising-diffusion-pytorch/issues/213

lilijiangg commented 12 months ago

Thank you for raising this issue. However, the author has never encountered this situation before. Could you please provide a more detailed description of the problem? For instance, the settings of your hyperparameters, or which time steps sequence you were validating when this problem occurred?

sihouzi21c commented 12 months ago

In my experiment, all the hyperparameters are almost the same as described in 'search_dynamic_unet_imagenet64_classifier_guidance_progressive.sh', the only difference is batch_size because of GPU memory boundary, which in detail is

'MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True" SAMPLE_FLAGS="--batch_size 50 --num_samples 1000 --use_ddim True" CUDA_VISIBLE_DEVICES=2 \ python search_dynamic_unet_imagenet64_classifier_guidance_progressive.py $MODEL_FLAGS --classifier_scale 1.0 \ --classifier_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/checkpoints/64x64_classifier.pt --classifier_depth 4 \ --model_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/checkpoints/64x64_diffusion.pt $SAMPLE_FLAGS \ --ref_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/fidstats/imagenet_ref_stats.pkl \ --save_dir '/home/hjsun/_quantization/_dm/AutoDiffusion-data/task/trysearch1' \ --time_step 10 \ --max_epochs 15 \ --population_num 50 \ --mutation_num 25 \ --crossover_num 10 \ --seed 0 \ --m_prob 0.25 \ --use_ddim_init_x True \ --use_dynamic_unet True \ --index_step 580 \ --max_prun=0.1 \ --min_prun=0.0 \ --MASTER_PORT '12345' \'

And the error occuered in the init stage which uses 'cal_fid' function to get the FID score of the corresponding candidate. The detailed information is as follow:

/home/hjsun/miniconda3/envs/autodiffusion1/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning : Failed to load image Python extension: 'libjpeg.so.9: cannot open shared object file: No such file or direct ory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwi se, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Logging to /home/hjsun/_quantization/_dm/AutoDiffusion-data/task/trysearch1
creating model and diffusion...
2023-10-14 01:10:30.316425: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see s lightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-10-14 01:10:30.354326: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is opt imized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow wi th the appropriate compiler flags.
2023-10-14 01:10:31.016762: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-14 01:10:31.881685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:local host/replica:0/task:0/device:GPU:0 with 20427 MB memory: -> device: 0, name: NVIDIA Graphics Device, pci bus id: 0000:88:00.0, compute capability: 8.9
2023-10-14 01:10:31.906496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:local host/replica:0/task:0/device:GPU:0 with 20427 MB memory: -> device: 0, name: NVIDIA Graphics Device, pci bus id: 0000:88:00.0, compute capability: 8.9
2023-10-14 01:10:32.409969: W tensorflow/core/framework/op_def_util.cc:369] Op BatchNormWithGlobalNormalizatio n is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
population_num = 50 select_num = 10 mutation_num = 25 crossover_num = 10 random_num = 15 max_epochs = 15
sampling...
created 50 samples
created 100 samples
created 150 samples
created 200 samples
created 250 samples
created 300 samples
created 350 samples
created 400 samples
created 450 samples created 500 samples
created 550 samples created 600 samples created 650 samples created 700 samples created 750 samples created 800 samples created 850 samples created 900 samples created 950 samples created 1000 samples sampling complete computing sample batch activations... 2023-10-14 01:12:02.066193: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled 2023-10-14 01:12:03.842190: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8700 2023-10-14 01:12:03.864856: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory 2023-10-14 01:12:04.183810: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. computing/reading sample batch statistics... Computing evaluations... Traceback (most recent call last): File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/search_dynamic_unet_imagenet64_classifier_guidance_progressive.py", line 806, in searcher.search() File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/search_dynamic_unet_imagenet64_classifier_guidance_progressive.py", line 683, in search self.is_legal_before_search(str(init_cand))#判断是否info里已经visit过了 File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/search_dynamic_unet_imagenet64_classifier_guidance_progressive.py", line 359, in is_legal_before_search info['fid'] = self.get_cand_fid(args=self.args, cand=eval(cand)) File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/search_dynamic_unet_imagenet64_classifier_guidance_progressive.py", line 451, in get_cand_fid fid = cal_fid(arr, 64, self.evaluator, ref_stats=self.ref_stats) File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/evaluations/evaluator_v1.py", line 757, in cal_fid fid = sample_stats.frechet_distance(ref_stats) File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/evaluations/evaluator_v1.py", line 153, in frechet_distance raise ValueError("Imaginary component {}".format(m)) ValueError: Imaginary component 8.354175847475648e+59

lilijiangg commented 12 months ago

I am still uncertain about the root cause of the error you've encountered. I've recently uploaded the environment configuration file named “diffusion.yml”. This configuration was set up on a V100, and while using this environment, I didn’t encounter any bugs. Below is the record of my execution:

Logging to ./debug
creating model and diffusion...
2023-10-14 14:31:01.532304: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-10-14 14:31:01.565676: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194855000 Hz
2023-10-14 14:31:01.571126: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66d68150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-14 14:31:01.571185: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-10-14 14:31:01.572729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-10-14 14:31:01.588390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x94bd2c40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-14 14:31:01.588412: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB-LS, Compute Capability 7.0
2023-10-14 14:31:01.593274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:07:00.0 name: Tesla V100-SXM2-32GB-LS computeCapability: 7.0
coreClock: 1.44GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 772.48GiB/s
2023-10-14 14:31:01.593330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-10-14 14:31:01.593376: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-10-14 14:31:01.593410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-10-14 14:31:01.593442: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-10-14 14:31:01.593480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-10-14 14:31:01.593512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-10-14 14:31:01.598284: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:31:01.605631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2023-10-14 14:31:01.605663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-14 14:31:01.605670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2023-10-14 14:31:01.605675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2023-10-14 14:31:01.613032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 28695 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:07:00.0, compute capability: 7.0)
2023-10-14 14:31:01.642375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:07:00.0 name: Tesla V100-SXM2-32GB-LS computeCapability: 7.0
coreClock: 1.44GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 772.48GiB/s
2023-10-14 14:31:01.642410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-10-14 14:31:01.642423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-10-14 14:31:01.642432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-10-14 14:31:01.642447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-10-14 14:31:01.642463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-10-14 14:31:01.642476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-10-14 14:31:01.642498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:31:01.650680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2023-10-14 14:31:01.899976: W tensorflow/core/framework/op_def_util.cc:371] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
population_num = 50 select_num = 10 mutation_num = 25 crossover_num = 10 random_num = 15 max_epochs = 15
sampling...
created 50 samples
created 100 samples
created 150 samples
created 200 samples
created 250 samples
created 300 samples
created 350 samples
created 400 samples
created 450 samples
created 500 samples
created 550 samples
created 600 samples
created 650 samples
created 700 samples
created 750 samples
created 800 samples
created 850 samples
created 900 samples
created 950 samples
created 1000 samples
sampling complete
computing sample batch activations...
2023-10-14 14:34:57.064026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:34:58.113249: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
computing/reading sample batch statistics...
Computing evaluations...
reset_time: 0.0003044605255126953, sample_time: 231.99321126937866, fid_time: 38.06192636489868
cand: {'timesteps': [0, 800, 100, 900, 200, 300, 400, 500, 600, 700], 'skip_layers': [[], [], [], [], [], [], [], [], [], []]}, fid: 48.23126452558853
random select ........
sampling...

I hope this information can be of help to you.

sihouzi21c commented 10 months ago

https://github.com/ahmadki/mlperf_sd_inference/issues/4 I've fixed this problem with scipy==1.9.1, thank you for your reply!

lilijiangg / AutoDiffusion

FID calculation error: ValueError #4