[BUG]torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacty of 14.75 GiB of which 1.56 GiB is free.

huggingface / autotrain-advanced

🤗 AutoTrain Advanced

Apache License 2.0

3.99k stars 484 forks source link

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Colab

Interface Used

CLI Command

No response

UI Screenshots & Parameters

No response

Error Logs

INFO Namespace(version=False, revision=None, tokenizer=None, image_path='images/', class_image_path=None, prompt='photo of VM', class_prompt=None, num_class_images=100, class_labels_conditioning=None, prior_preservation=None, prior_loss_weight=1.0, resolution=1024, center_crop=None, train_text_encoder=None, sample_batch_size=4, num_steps=500, checkpointing_steps=100000, resume_from_checkpoint=None, scale_lr=None, scheduler='constant', warmup_steps=0, num_cycles=1, lr_power=1.0, dataloader_num_workers=0, use_8bit_adam=None, adam_beta1=0.9, adam_beta2=0.999, adam_weight_decay=0.01, adam_epsilon=1e-08, max_grad_norm=1.0, allow_tf32=None, prior_generation_precision=None, local_rank=-1, xformers=None, pre_compute_text_embeddings=None, tokenizer_max_length=None, text_encoder_use_attention_mask=None, rank=4, xl=None, fp16=True, bf16=None, validation_prompt=None, num_validation_images=4, validation_epochs=50, checkpoints_total_limit=None, validation_images=None, logging=None, train=None, deploy=None, inference=None, username=None, backend='local-cli', token=None, repo_id=None, push_to_hub=None, model='stabilityai/stable-diffusion-xl-base-1.0', project_name='linkedin_pfp_project', seed=42, epochs=1, gradient_accumulation=4, disable_gradient_checkpointing=None, lr=0.0001, log='none', data_path=None, train_split='train', valid_split=None, batch_size=6, func=<function run_dreambooth_command_factory at 0x785457122d40>) INFO Running DreamBooth Training WARNING Parameters supplied but not used: train_split, data_path, version, inference, train, log, valid_split, backend, func, deploy INFO Dataset: linkedin_pfp_project (dreambooth)

INFO Saving concept images INFO images/IMG_20240120_170036.jpg INFO Saving concept images INFO images/IMG_20240120_175829.jpg INFO Saving concept images INFO images/IMG_20240120_175715.jpg INFO Saving concept images INFO images/IMG_20240120_175717.jpg INFO Saving concept images INFO images/IMG_20240120_175908.jpg INFO Saving concept images INFO images/IMG_20240120_175752.jpg INFO Saving concept images INFO images/IMG_20240120_175956.jpg INFO Saving concept images INFO images/IMG_20240120_170011.jpg INFO Saving concept images INFO images/IMG_20240120_175954.jpg INFO Saving concept images INFO images/IMG_20240120_175735.jpg INFO Saving concept images INFO images/IMG_20240120_175748.jpg INFO Starting local training... INFO {"model":"stabilityai/stable-diffusion-xl-base-1.0","revision":null,"tokenizer":null,"image_path":"linkedin_pfp_project/autotrain-data","class_image_path":null,"prompt":"photo of VM","class_prompt":null,"num_class_images":100,"class_labels_conditioning":null,"prior_preservation":false,"prior_loss_weight":1.0,"project_name":"linkedin_pfp_project","seed":42,"resolution":1024,"center_crop":false,"train_text_encoder":false,"batch_size":6,"sample_batch_size":4,"epochs":1,"num_steps":500,"checkpointing_steps":100000,"resume_from_checkpoint":null,"gradient_accumulation":4,"disable_gradient_checkpointing":false,"lr":0.0001,"scale_lr":false,"scheduler":"constant","warmup_steps":0,"num_cycles":1,"lr_power":1.0,"dataloader_num_workers":0,"use_8bit_adam":false,"adam_beta1":0.9,"adam_beta2":0.999,"adam_weight_decay":0.01,"adam_epsilon":1e-8,"max_grad_norm":1.0,"allow_tf32":false,"prior_generation_precision":null,"local_rank":-1,"xformers":false,"pre_compute_text_embeddings":false,"tokenizer_max_length":null,"text_encoder_use_attention_mask":false,"rank":4,"xl":true,"fp16":true,"bf16":false,"token":null,"repo_id":null,"push_to_hub":false,"username":null,"validation_prompt":null,"num_validation_images":4,"validation_epochs":50,"checkpoints_total_limit":null,"validation_images":null,"logging":false} INFO ['python', '-m', 'autotrain.trainers.dreambooth', '--training_config', 'linkedin_pfp_project/training_params.json'] You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. {'attention_type', 'dropout'} was not found in config. Values will be initialized to default values. {'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values. 🚀 INFO | 2024-02-11 18:04:55 | autotrain.trainers.dreambooth.utils:enable_gradient_checkpointing:298 - Enabling gradient checkpointing. 🚀 INFO | 2024-02-11 18:04:55 | autotrain.trainers.dreambooth.trainer:compute_text_embeddings:140 - Computing text embeddings for prompt: photo of VM 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:124 - Running training 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:125 - Num examples = 11 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:126 - Num batches each epoch = 2 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:127 - Num Epochs = 500 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:128 - Instantaneous batch size per device = 6 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:129 - Total train batch size (w. parallel, distributed & accumulation) = 24 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:130 - Gradient Accumulation steps = 4 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:131 - Total optimization steps = 500 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:132 - Training config = {'model': 'stabilityai/stable-diffusion-xl-base-1.0', 'revision': None, 'tokenizer': None, 'image_path': 'linkedin_pfp_project/autotrain-data/concept1', 'class_image_path': None, 'prompt': 'photo of VM', 'class_prompt': None, 'num_class_images': 100, 'class_labels_conditioning': None, 'prior_preservation': False, 'prior_loss_weight': 1.0, 'project_name': 'linkedin_pfp_project', 'seed': 42, 'resolution': 1024, 'center_crop': False, 'train_text_encoder': False, 'batch_size': 6, 'sample_batch_size': 4, 'epochs': 500, 'num_steps': 500, 'checkpointing_steps': 100000, 'resume_from_checkpoint': None, 'gradient_accumulation': 4, 'disable_gradient_checkpointing': False, 'lr': 0.0001, 'scale_lr': False, 'scheduler': 'constant', 'warmup_steps': 0, 'num_cycles': 1, 'lr_power': 1.0, 'dataloader_num_workers': 0, 'use_8bit_adam': False, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_weight_decay': 0.01, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'allow_tf32': False, 'prior_generation_precision': None, 'local_rank': -1, 'xformers': False, 'pre_compute_text_embeddings': False, 'tokenizer_max_length': None, 'text_encoder_use_attention_mask': False, 'rank': 4, 'xl': True, 'fp16': True, 'bf16': False, 'token': None, 'repo_id': None, 'push_to_hub': False, 'username': None, 'validation_prompt': None, 'num_validation_images': 4, 'validation_epochs': 50, 'checkpoints_total_limit': None, 'validation_images': None, 'logging': False} Steps: 0% 0/500 [00:00<?, ?it/s]❌ ERROR | 2024-02-11 18:04:58 | autotrain.trainers.common:wrapper:91 - train has failed due to an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py", line 88, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/dreambooth/main.py", line 312, in train trainer.train() File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/dreambooth/trainer.py", line 383, in train model_input = self.vae.encode(pixel_values).latent_dist.sample() File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper return method(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 260, in encode h = self.encoder(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 141, in forward sample = down_block(sample) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1247, in forward hidden_states = resnet(hidden_states, temb=None, scale=scale) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 608, in forward hidden_states = self.nonlinearity(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward return F.silu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2072, in silu return torch._C._nn.silu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacty of 14.75 GiB of which 1.56 GiB is free. Process 85234 has 13.19 GiB memory in use. Of the allocated memory 12.83 GiB is allocated by PyTorch, and 231.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

❌ ERROR | 2024-02-11 18:04:58 | autotrain.trainers.common:wrapper:92 - CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacty of 14.75 GiB of which 1.56 GiB is free. Process 85234 has 13.19 GiB memory in use. Of the allocated memory 12.83 GiB is allocated by PyTorch, and 231.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0% 0/500 [00:02<?, ?it/s]

Additional Information

i have tried setting the batch size to 15 and 6 from the original value of 30

i have tried running the following code to clean memory but it has no effect:

import torch

# Clear GPU memory
torch.cuda.empty_cache()

i'm not using google collab pro , currently on free

please what other tweaks can i make to the parameters to run this code sucessfully?

Prerequisites

[x] I have read the documentation.

[x] I have checked other issues for similar problems.

Backend

Colab

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

No response

Error Logs

INFO Namespace(version=False, revision=None, tokenizer=None, image_path='images/', class_image_path=None, prompt='photo of VM', class_prompt=None, num_class_images=100, class_labels_conditioning=None, prior_preservation=None, prior_loss_weight=1.0, resolution=1024, center_crop=None, train_text_encoder=None, sample_batch_size=4, num_steps=500, checkpointing_steps=100000, resume_from_checkpoint=None, scale_lr=None, scheduler='constant', warmup_steps=0, num_cycles=1, lr_power=1.0, dataloader_num_workers=0, use_8bit_adam=None, adam_beta1=0.9, adam_beta2=0.999, adam_weight_decay=0.01, adam_epsilon=1e-08, max_grad_norm=1.0, allow_tf32=None, prior_generation_precision=None, local_rank=-1, xformers=None, pre_compute_text_embeddings=None, tokenizer_max_length=None, text_encoder_use_attention_mask=None, rank=4, xl=None, fp16=True, bf16=None, validation_prompt=None, num_validation_images=4, validation_epochs=50, checkpoints_total_limit=None, validation_images=None, logging=None, train=None, deploy=None, inference=None, username=None, backend='local-cli', token=None, repo_id=None, push_to_hub=None, model='stabilityai/stable-diffusion-xl-base-1.0', project_name='linkedin_pfp_project', seed=42, epochs=1, gradient_accumulation=4, disable_gradient_checkpointing=None, lr=0.0001, log='none', data_path=None, train_split='train', valid_split=None, batch_size=6, func=<function run_dreambooth_command_factory at 0x785457122d40>) INFO Running DreamBooth Training WARNING Parameters supplied but not used: train_split, data_path, version, inference, train, log, valid_split, backend, func, deploy INFO Dataset: linkedin_pfp_project (dreambooth)

INFO Saving concept images INFO images/IMG_20240120_170036.jpg INFO Saving concept images INFO images/IMG_20240120_175829.jpg INFO Saving concept images INFO images/IMG_20240120_175715.jpg INFO Saving concept images INFO images/IMG_20240120_175717.jpg INFO Saving concept images INFO images/IMG_20240120_175908.jpg INFO Saving concept images INFO images/IMG_20240120_175752.jpg INFO Saving concept images INFO images/IMG_20240120_175956.jpg INFO Saving concept images INFO images/IMG_20240120_170011.jpg INFO Saving concept images INFO images/IMG_20240120_175954.jpg INFO Saving concept images INFO images/IMG_20240120_175735.jpg INFO Saving concept images INFO images/IMG_20240120_175748.jpg INFO Starting local training... INFO {"model":"stabilityai/stable-diffusion-xl-base-1.0","revision":null,"tokenizer":null,"image_path":"linkedin_pfp_project/autotrain-data","class_image_path":null,"prompt":"photo of VM","class_prompt":null,"num_class_images":100,"class_labels_conditioning":null,"prior_preservation":false,"prior_loss_weight":1.0,"project_name":"linkedin_pfp_project","seed":42,"resolution":1024,"center_crop":false,"train_text_encoder":false,"batch_size":6,"sample_batch_size":4,"epochs":1,"num_steps":500,"checkpointing_steps":100000,"resume_from_checkpoint":null,"gradient_accumulation":4,"disable_gradient_checkpointing":false,"lr":0.0001,"scale_lr":false,"scheduler":"constant","warmup_steps":0,"num_cycles":1,"lr_power":1.0,"dataloader_num_workers":0,"use_8bit_adam":false,"adam_beta1":0.9,"adam_beta2":0.999,"adam_weight_decay":0.01,"adam_epsilon":1e-8,"max_grad_norm":1.0,"allow_tf32":false,"prior_generation_precision":null,"local_rank":-1,"xformers":false,"pre_compute_text_embeddings":false,"tokenizer_max_length":null,"text_encoder_use_attention_mask":false,"rank":4,"xl":true,"fp16":true,"bf16":false,"token":null,"repo_id":null,"push_to_hub":false,"username":null,"validation_prompt":null,"num_validation_images":4,"validation_epochs":50,"checkpoints_total_limit":null,"validation_images":null,"logging":false} INFO ['python', '-m', 'autotrain.trainers.dreambooth', '--training_config', 'linkedin_pfp_project/training_params.json'] You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. {'attention_type', 'dropout'} was not found in config. Values will be initialized to default values. {'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values. 🚀 INFO | 2024-02-11 18:04:55 | autotrain.trainers.dreambooth.utils:enable_gradient_checkpointing:298 - Enabling gradient checkpointing. 🚀 INFO | 2024-02-11 18:04:55 | autotrain.trainers.dreambooth.trainer:compute_text_embeddings:140 - Computing text embeddings for prompt: photo of VM 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:124 - Running training 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:125 - Num examples = 11 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:126 - Num batches each epoch = 2 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:127 - Num Epochs = 500 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:128 - Instantaneous batch size per device = 6 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:129 - Total train batch size (w. parallel, distributed & accumulation) = 24 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:130 - Gradient Accumulation steps = 4 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:131 - Total optimization steps = 500 🚀 INFO | 2024-02-11 18:04:56 | autotrain.trainers.dreambooth.trainer:init:132 - Training config = {'model': 'stabilityai/stable-diffusion-xl-base-1.0', 'revision': None, 'tokenizer': None, 'image_path': 'linkedin_pfp_project/autotrain-data/concept1', 'class_image_path': None, 'prompt': 'photo of VM', 'class_prompt': None, 'num_class_images': 100, 'class_labels_conditioning': None, 'prior_preservation': False, 'prior_loss_weight': 1.0, 'project_name': 'linkedin_pfp_project', 'seed': 42, 'resolution': 1024, 'center_crop': False, 'train_text_encoder': False, 'batch_size': 6, 'sample_batch_size': 4, 'epochs': 500, 'num_steps': 500, 'checkpointing_steps': 100000, 'resume_from_checkpoint': None, 'gradient_accumulation': 4, 'disable_gradient_checkpointing': False, 'lr': 0.0001, 'scale_lr': False, 'scheduler': 'constant', 'warmup_steps': 0, 'num_cycles': 1, 'lr_power': 1.0, 'dataloader_num_workers': 0, 'use_8bit_adam': False, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_weight_decay': 0.01, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'allow_tf32': False, 'prior_generation_precision': None, 'local_rank': -1, 'xformers': False, 'pre_compute_text_embeddings': False, 'tokenizer_max_length': None, 'text_encoder_use_attention_mask': False, 'rank': 4, 'xl': True, 'fp16': True, 'bf16': False, 'token': None, 'repo_id': None, 'push_to_hub': False, 'username': None, 'validation_prompt': None, 'num_validation_images': 4, 'validation_epochs': 50, 'checkpoints_total_limit': None, 'validation_images': None, 'logging': False} Steps: 0% 0/500 [00:00<?, ?it/s]❌ ERROR | 2024-02-11 18:04:58 | autotrain.trainers.common:wrapper:91 - train has failed due to an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/common.py", line 88, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/dreambooth/main.py", line 312, in train trainer.train() File "/usr/local/lib/python3.10/dist-packages/autotrain/trainers/dreambooth/trainer.py", line 383, in train model_input = self.vae.encode(pixel_values).latent_dist.sample() File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper return method(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 260, in encode h = self.encoder(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 141, in forward sample = down_block(sample) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1247, in forward hidden_states = resnet(hidden_states, temb=None, scale=scale) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 608, in forward hidden_states = self.nonlinearity(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward return F.silu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2072, in silu return torch._C._nn.silu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacty of 14.75 GiB of which 1.56 GiB is free. Process 85234 has 13.19 GiB memory in use. Of the allocated memory 12.83 GiB is allocated by PyTorch, and 231.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

❌ ERROR | 2024-02-11 18:04:58 | autotrain.trainers.common:wrapper:92 - CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacty of 14.75 GiB of which 1.56 GiB is free. Process 85234 has 13.19 GiB memory in use. Of the allocated memory 12.83 GiB is allocated by PyTorch, and 231.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0% 0/500 [00:02<?, ?it/s]

Additional Information

i have tried setting the batch size to 15 and 6 from the original value of 30

i have tried running the following code to clean memory but it has no effect:
import torch

# Clear GPU memory
torch.cuda.empty_cache()
i'm not using google collab pro , currently on free

please what other tweaks can i make to the parameters to run this code sucessfully?

AFTER MUCH TRIAL AND ERROR, A BATCH SIZE OF JUST 1 DID THE TRICK ON GOOGLE COLLAB FREE

huggingface / autotrain-advanced