Closed RoufaidaLaidi closed 9 months ago
Hi Roufaida,
Thank you for your question! It is similar to the issue of https://github.com/microsoft/DPSDA/issues/2. The default parameter was run on 16 32GB V100 GPUs. Reducing the batch size, e.g., at this line https://github.com/microsoft/DPSDA/blob/79f0868b1a13769c1d16bba3a32f81502690e83a/scripts/main_improved_diffusion_cifar10_conditional.sh#L32, should be able to solve the problem. You can start by setting it to a small value, say 10, which should be well under the 40GB. Please let me know if it still doesn't work, and you can paste the complete console log here so that I have more information to debug it.
Hi Zinan, Thank you for the reply. We have successfully executed the code on a single GPU by reducing the batch size, however the FID scores obtained were significantly higher than those reported in your paper.
To improve performance, we attempted to run the code on multiple GPUs (8 32GB V100 GPUs). We noticed that the code utilizes torch.nn.DataParallel, and we ensured that the --use_data_parallel flag is set to True. However, despite these settings, the training process seems to be confined to a single GPU.
As a sanity check, I have tested my environment with a toy code designed to run on multiple GPUs, which executed successfully and utilized all available GPUs as expected. This leads me to believe that my environment is correctly configured for multi-GPU training.
Given this context, I would greatly appreciate any recommendations or insights you might have on how to effectively run your code on multiple GPUs. Specifically, I'm looking for guidance on:
Hi Roufaida,
Thank you for your message.
You mentioned two issues: one is that the FID scores are not good enough; the other is that the code does not utilize multi-GPU successfully. Note that these two issues are orthogonal: the batch sizes and whether to use multi-GPU should not impact the sample quality (e.g., FID) at all, as Private Evolution algorithm does not do any actual model training and the batch size instead is the inference batch size for the diffusion model APIs. (This answers your question 3.)
Below, I discuss the two questions separately.
The information you provided is not sufficient for me to debug the root cause. Could you please provide more details about:
From our experience, Private Evolution algorithm is very stable--running the algorithm with different random seeds gets very similar FID scores. Providing more information above will help me in figuring out the cause of this issue.
You are right that the code use torch.nn.DataParallel to support multi-GPU inference. Without any special configurations, the code should be able to utilize all GPUs automatically. Again, could you please provide more details about
print(torch.cuda.device_count())
?Thank you, Zinan
Dear Zinan,
Thank you for your prompt response and willingness to assist with the issues we've been encountering. I'm providing more specific details as requested:
Command Executed: ./main_improved_diffusion_camelyon17_conditional.sh
Modifications to Default Parameters:
--num_samples_schedule 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000 \ --variation_degree_schedule 0,0,1,1,2,2,3,3,4,4 \ --num_fid_samples 1000 \ --num_private_samples 1000 \ --diffusion_steps 200 \ --batch_size 1 \ --data_loading_batch_size 1 \ --feature_extractor_batch_size 1 \ --fid_batch_size 1
- Additional Context: Full script file and fid.csv attached for review. main_improved_diffusion_camelyon17_conditional.txt fid.csv
Command Executed: Similar to the FID Score issue.
Console Log Warnings and Errors:
/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /cluster/home/laidir/.local/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs warn(f"Failed to load image Python extension: {e}") BLAS : Bad memory unallocation! : 256 0x7f7a93759000 BLAS : Bad memory unallocation! : 256 0x7f7a8b759000 BLAS : Bad memory unallocation! : 256 0x7f7a83759000 BLAS : Bad memory unallocation! : 256 0x7f7a7b759000 BLAS : Bad memory unallocation! : 256 0x7f7a73759000
I appreciate your support.
Best regards, Roufaida
Hi Roufaida,
As the hyper-pamarters you used here are not what we used in the paper, of course you will get different results. Especially, you only used 1000 private samples, which is much fewer than what we used. It doesn't make sense to compare FID scores computed with different number of samples. If you want to reproduce the results in our paper, please follow the hyper-parameters reported in the paper. See hyperameters for Camelyon in Section J, CIFAR10 in Section I, Cat dataset in Section K of the paper https://openreview.net/pdf?id=YEhQs8POIo
From the error it seems like there might be some libary installation issue with torchvision. It has nothing to do with DPSDA code itself, and you might want to double-check how you installed the torchvision library to fix it.
Moreover, I am not sure if this error relates to the multi-GPU issue. For debugging purposes, I would suggest running the simplest DataParallel example in https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html?highlight=dataparallel and see if it is able to use multi-GPU. If not, you might want to post issues in pytorch github repo and seek help there.
If these still do not resolve the problem, feel free to reopen the issue.
Environment
Description
I am experiencing an out-of-memory (OOM) error when attempting to run the
main_improved_diffusion_cifar10_conditional.sh
andmain_improved_diffusion_cifar10_conditional.sh
scripts. Despite utilizing an NVIDIA A100 GPU with 40GB of memory, which should be sufficient for these tasks, the scripts consistently fail due to memory issues.The expected behavior, based on documentation and typical usage for similar tasks, would not exceed the 40GB memory limit of the A100 GPU. However, even under normal conditions and with ample available memory, the scripts trigger an OOM error.
Attempts to Resolve
I appreciate any insights, suggestions, or updates that might help resolve this issue. Thank you for your attention to this matter and for the valuable resources provided.
Best regards, Roufaida Laidi