Out of Memory Error on A100 40GB GPUs with main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh

RoufaidaLaidi commented 10 months ago

Environment

PyTorch Version: 1.12.1
CUDA Version: 11.7
GPU Type: NVIDIA A100 40GB

Description

I am experiencing an out-of-memory (OOM) error when attempting to run the main_improved_diffusion_cifar10_conditional.sh and main_improved_diffusion_cifar10_conditional.sh scripts. Despite utilizing an NVIDIA A100 GPU with 40GB of memory, which should be sufficient for these tasks, the scripts consistently fail due to memory issues.

The expected behavior, based on documentation and typical usage for similar tasks, would not exceed the 40GB memory limit of the A100 GPU. However, even under normal conditions and with ample available memory, the scripts trigger an OOM error.

Attempts to Resolve

Ensured no other significant processes are consuming GPU memory.
Monitored memory usage to confirm that the OOM error occurs despite available memory.
Reduced batch size and num_samples_schedule (and related parameters)
Searched for similar issues or advice in the repository's issues section and online forums.

I appreciate any insights, suggestions, or updates that might help resolve this issue. Thank you for your attention to this matter and for the valuable resources provided.

Best regards, Roufaida Laidi

fjxmlzn commented 10 months ago

Hi Roufaida,

Thank you for your question! It is similar to the issue of https://github.com/microsoft/DPSDA/issues/2. The default parameter was run on 16 32GB V100 GPUs. Reducing the batch size, e.g., at this line https://github.com/microsoft/DPSDA/blob/79f0868b1a13769c1d16bba3a32f81502690e83a/scripts/main_improved_diffusion_cifar10_conditional.sh#L32, should be able to solve the problem. You can start by setting it to a small value, say 10, which should be well under the 40GB. Please let me know if it still doesn't work, and you can paste the complete console log here so that I have more information to debug it.

RoufaidaLaidi commented 10 months ago

Hi Zinan, Thank you for the reply. We have successfully executed the code on a single GPU by reducing the batch size, however the FID scores obtained were significantly higher than those reported in your paper.

To improve performance, we attempted to run the code on multiple GPUs (8 32GB V100 GPUs). We noticed that the code utilizes torch.nn.DataParallel, and we ensured that the --use_data_parallel flag is set to True. However, despite these settings, the training process seems to be confined to a single GPU.

As a sanity check, I have tested my environment with a toy code designed to run on multiple GPUs, which executed successfully and utilized all available GPUs as expected. This leads me to believe that my environment is correctly configured for multi-GPU training.

Given this context, I would greatly appreciate any recommendations or insights you might have on how to effectively run your code on multiple GPUs. Specifically, I'm looking for guidance on:

Any specific configurations or environment settings required for your code to utilize multiple GPUs.
Suggestions on how to ensure that torch.nn.DataParallel is being effectively used in your training script.
If multi-GPU training is expected to improve the FID scores.

fjxmlzn commented 10 months ago

Hi Roufaida,

Thank you for your message.

You mentioned two issues: one is that the FID scores are not good enough; the other is that the code does not utilize multi-GPU successfully. Note that these two issues are orthogonal: the batch sizes and whether to use multi-GPU should not impact the sample quality (e.g., FID) at all, as Private Evolution algorithm does not do any actual model training and the batch size instead is the inference batch size for the diffusion model APIs. (This answers your question 3.)

Below, I discuss the two questions separately.

About FID

The information you provided is not sufficient for me to debug the root cause. Could you please provide more details about:

Which command did you run?
What exactly the FID scores did you get? There should be a fid.csv file generated by the code. Could you please share it?

From our experience, Private Evolution algorithm is very stable--running the algorithm with different random seeds gets very similar FID scores. Providing more information above will help me in figuring out the cause of this issue.

About multi-GPU

You are right that the code use torch.nn.DataParallel to support multi-GPU inference. Without any special configurations, the code should be able to utilize all GPUs automatically. Again, could you please provide more details about

Which command did you run?
Could you please share the complete console log here so that I have more information to debug it?
Could you please share the output of print(torch.cuda.device_count())?

Thank you, Zinan

RoufaidaLaidi commented 9 months ago

Dear Zinan,

Thank you for your prompt response and willingness to assist with the issues we've been encountering. I'm providing more specific details as requested:

FID Score Issue

Command Executed: ./main_improved_diffusion_camelyon17_conditional.sh
Modifications to Default Parameters:

--num_samples_schedule 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000 \ --variation_degree_schedule 0,0,1,1,2,2,3,3,4,4 \ --num_fid_samples 1000 \ --num_private_samples 1000 \ --diffusion_steps 200 \ --batch_size 1 \ --data_loading_batch_size 1 \ --feature_extractor_batch_size 1 \ --fid_batch_size 1

- Additional Context: Full script file and fid.csv attached for review. main_improved_diffusion_camelyon17_conditional.txt fid.csv

Multi-GPU Utilization

Command Executed: Similar to the FID Score issue.
Console Log Warnings and Errors:

/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /cluster/home/laidir/.local/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs warn(f"Failed to load image Python extension: {e}") BLAS : Bad memory unallocation! : 256 0x7f7a93759000 BLAS : Bad memory unallocation! : 256 0x7f7a8b759000 BLAS : Bad memory unallocation! : 256 0x7f7a83759000 BLAS : Bad memory unallocation! : 256 0x7f7a7b759000 BLAS : Bad memory unallocation! : 256 0x7f7a73759000

GPU Count Output: Output of print(torch.cuda.device_count()) was 2, 4, or 8, depending on the machine used. nvidia-smi shows execution on only one GPU.

I appreciate your support.

Best regards, Roufaida

fjxmlzn commented 9 months ago

Hi Roufaida,

About FID

As the hyper-pamarters you used here are not what we used in the paper, of course you will get different results. Especially, you only used 1000 private samples, which is much fewer than what we used. It doesn't make sense to compare FID scores computed with different number of samples. If you want to reproduce the results in our paper, please follow the hyper-parameters reported in the paper. See hyperameters for Camelyon in Section J, CIFAR10 in Section I, Cat dataset in Section K of the paper https://openreview.net/pdf?id=YEhQs8POIo

About multi-GPU

From the error it seems like there might be some libary installation issue with torchvision. It has nothing to do with DPSDA code itself, and you might want to double-check how you installed the torchvision library to fix it.

Moreover, I am not sure if this error relates to the multi-GPU issue. For debugging purposes, I would suggest running the simplest DataParallel example in https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html?highlight=dataparallel and see if it is able to use multi-GPU. If not, you might want to post issues in pytorch github repo and seek help there.

If these still do not resolve the problem, feel free to reopen the issue.

microsoft / DPSDA