intfloat / SimKGC

ACL 2022, SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models
188 stars 36 forks source link

CUDA out of memory #43

Open lijun-1999 opened 6 months ago

lijun-1999 commented 6 months ago

Hello, I noticed in your readme.md file that it states if you encounter a "CUDA out of memory" issue, it might be due to limited hardware resources. However, I am using a server with 4 V100 GPUs, so why am I still facing this problem? Moreover, reducing the batch size did not resolve the issue.

intfloat commented 6 months ago

Does your V100 GPUs have 32G memory or 16G memory? The training part requires 32GB V100 GPUs.

And have you made any changes to the code?

lijun-1999 commented 6 months ago

Dear Professor, Hello! I am using a server with 4 V100 GPUs (each with 32GB of memory). When I run the SimKGC project without any modifications in this environment, I encounter the following error:

Traceback (most recent call last): File "main.py", line 22, in main() File "main.py", line 18, in main trainer.train_loop() File "/root/autodl-tmp/SimKGC-main/trainer.py", line 76, in train_loop self.train_epoch(epoch) File "/root/autodl-tmp/SimKGC-main/trainer.py", line 151, in train_epoch outputs = self.model(batch_dict) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(input, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/autodl-tmp/SimKGC-main/models.py", line 66, in forward hr_vector = self._encode(self.hr_bert, File "/root/autodl-tmp/SimKGC-main/models.py", line 47, in _encode outputs = encoder(input_ids=token_ids, File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 996, in forward encoder_outputs = self.encoder( File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 585, in forward layer_outputs = layer_module( File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 472, in forward self_attention_outputs = self.attention( File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 402, in forward self_outputs = self.self( File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 340, in forward context_layer = torch.matmul(attention_probs, value_layer) RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling result

Previously, when I was not running in the 4 V100 environment, I received a different error in the code, which I modified. However, after making the changes, when I moved to the 4 V100 environment, I started encountering the "CUDA out of memory" error. Could you please provide some guidance on how to address this issue? Best regards!

At 2024-06-03 18:31:59, "Liang Wang" @.***> wrote:

Does your V100 GPUs have 32G memory or 16G memory? The training part requires 32GB V100 GPUs.

And have you made any changes to the code?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>