Closed sihouzi21c closed 10 months ago
Thank you for raising this issue. However, the author has never encountered this situation before. Could you please provide a more detailed description of the problem? For instance, the settings of your hyperparameters, or which time steps sequence you were validating when this problem occurred?
In my experiment, all the hyperparameters are almost the same as described in 'search_dynamic_unet_imagenet64_classifier_guidance_progressive.sh', the only difference is batch_size because of GPU memory boundary, which in detail is
'MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True" SAMPLE_FLAGS="--batch_size 50 --num_samples 1000 --use_ddim True" CUDA_VISIBLE_DEVICES=2 \ python search_dynamic_unet_imagenet64_classifier_guidance_progressive.py $MODEL_FLAGS --classifier_scale 1.0 \ --classifier_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/checkpoints/64x64_classifier.pt --classifier_depth 4 \ --model_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/checkpoints/64x64_diffusion.pt $SAMPLE_FLAGS \ --ref_path /home/hjsun/_quantization/_dm/AutoDiffusion-data/fidstats/imagenet_ref_stats.pkl \ --save_dir '/home/hjsun/_quantization/_dm/AutoDiffusion-data/task/trysearch1' \ --time_step 10 \ --max_epochs 15 \ --population_num 50 \ --mutation_num 25 \ --crossover_num 10 \ --seed 0 \ --m_prob 0.25 \ --use_ddim_init_x True \ --use_dynamic_unet True \ --index_step 580 \ --max_prun=0.1 \ --min_prun=0.0 \ --MASTER_PORT '12345' \'
And the error occuered in the init stage which uses 'cal_fid' function to get the FID score of the corresponding candidate. The detailed information is as follow:
/home/hjsun/miniconda3/envs/autodiffusion1/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning
: Failed to load image Python extension: 'libjpeg.so.9: cannot open shared object file: No such file or direct
ory'If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwi
se, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before
building torchvision
from source?
warn(
Logging to /home/hjsun/_quantization/_dm/AutoDiffusion-data/task/trysearch1
creating model and diffusion...
2023-10-14 01:10:30.316425: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see s
lightly different numerical results due to floating-point round-off errors from different computation orders.
To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2023-10-14 01:10:30.354326: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is opt
imized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow wi
th the appropriate compiler flags.
2023-10-14 01:10:31.016762: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not
find TensorRT
2023-10-14 01:10:31.881685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:local
host/replica:0/task:0/device:GPU:0 with 20427 MB memory: -> device: 0, name: NVIDIA Graphics Device, pci bus
id: 0000:88:00.0, compute capability: 8.9
2023-10-14 01:10:31.906496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:local
host/replica:0/task:0/device:GPU:0 with 20427 MB memory: -> device: 0, name: NVIDIA Graphics Device, pci bus
id: 0000:88:00.0, compute capability: 8.9
2023-10-14 01:10:32.409969: W tensorflow/core/framework/op_def_util.cc:369] Op BatchNormWithGlobalNormalizatio
n is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
population_num = 50 select_num = 10 mutation_num = 25 crossover_num = 10 random_num = 15 max_epochs = 15
sampling...
created 50 samples
created 100 samples
created 150 samples
created 200 samples
created 250 samples
created 300 samples
created 350 samples
created 400 samples
created 450 samples
created 500 samples
created 550 samples
created 600 samples
created 650 samples
created 700 samples
created 750 samples
created 800 samples
created 850 samples
created 900 samples
created 950 samples
created 1000 samples
sampling complete
computing sample batch activations...
2023-10-14 01:12:02.066193: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2023-10-14 01:12:03.842190: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8700
2023-10-14 01:12:03.864856: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-10-14 01:12:04.183810: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
computing/reading sample batch statistics...
Computing evaluations...
Traceback (most recent call last):
File "/home/hjsun/_quantization/_dm/AutoDiffusion/examples/guided_diffusion/search_dynamic_unet_imagenet64_classifier_guidance_progressive.py", line 806, in
I am still uncertain about the root cause of the error you've encountered. I've recently uploaded the environment configuration file named “diffusion.yml”. This configuration was set up on a V100, and while using this environment, I didn’t encounter any bugs. Below is the record of my execution:
Logging to ./debug
creating model and diffusion...
2023-10-14 14:31:01.532304: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-10-14 14:31:01.565676: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194855000 Hz
2023-10-14 14:31:01.571126: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66d68150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-14 14:31:01.571185: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-10-14 14:31:01.572729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-10-14 14:31:01.588390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x94bd2c40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-14 14:31:01.588412: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB-LS, Compute Capability 7.0
2023-10-14 14:31:01.593274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:07:00.0 name: Tesla V100-SXM2-32GB-LS computeCapability: 7.0
coreClock: 1.44GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 772.48GiB/s
2023-10-14 14:31:01.593330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-10-14 14:31:01.593376: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-10-14 14:31:01.593410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-10-14 14:31:01.593442: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-10-14 14:31:01.593480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-10-14 14:31:01.593512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-10-14 14:31:01.598284: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:31:01.605631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2023-10-14 14:31:01.605663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-14 14:31:01.605670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2023-10-14 14:31:01.605675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2023-10-14 14:31:01.613032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 28695 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB-LS, pci bus id: 0000:07:00.0, compute capability: 7.0)
2023-10-14 14:31:01.642375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:07:00.0 name: Tesla V100-SXM2-32GB-LS computeCapability: 7.0
coreClock: 1.44GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 772.48GiB/s
2023-10-14 14:31:01.642410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-10-14 14:31:01.642423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-10-14 14:31:01.642432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-10-14 14:31:01.642447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-10-14 14:31:01.642463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-10-14 14:31:01.642476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-10-14 14:31:01.642498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:31:01.650680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2023-10-14 14:31:01.899976: W tensorflow/core/framework/op_def_util.cc:371] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
population_num = 50 select_num = 10 mutation_num = 25 crossover_num = 10 random_num = 15 max_epochs = 15
sampling...
created 50 samples
created 100 samples
created 150 samples
created 200 samples
created 250 samples
created 300 samples
created 350 samples
created 400 samples
created 450 samples
created 500 samples
created 550 samples
created 600 samples
created 650 samples
created 700 samples
created 750 samples
created 800 samples
created 850 samples
created 900 samples
created 950 samples
created 1000 samples
sampling complete
computing sample batch activations...
2023-10-14 14:34:57.064026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-10-14 14:34:58.113249: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
computing/reading sample batch statistics...
Computing evaluations...
reset_time: 0.0003044605255126953, sample_time: 231.99321126937866, fid_time: 38.06192636489868
cand: {'timesteps': [0, 800, 100, 900, 200, 300, 400, 500, 600, 700], 'skip_layers': [[], [], [], [], [], [], [], [], [], []]}, fid: 48.23126452558853
random select ........
sampling...
I hope this information can be of help to you.
https://github.com/ahmadki/mlperf_sd_inference/issues/4 I've fixed this problem with scipy==1.9.1, thank you for your reply!
After I run 'search_dynamic_unet_imagenet64_classifier_guidance_progressive.sh', I get 'ValueError: Imaginary component' which is caused by the 'frechet_distance' function in 'evaluator_v1.py'. I fix the problem with num_samples modified to 2048, as [1] said number of samples should be greater than 2048 to avoid the problem. Is there any method to fix the problem while maintain the num_samples to 1000?
[1]https://github.com/lucidrains/denoising-diffusion-pytorch/issues/213