patrickbryant1 / SpeedPPI

Rapid protein-protein interaction network creation from multiple sequence alignments with Deep Learning
Other
73 stars 16 forks source link

How to determine the memory requirement? #31

Open Rohit-Satyam opened 1 day ago

Rohit-Satyam commented 1 day ago

I am running SpeedPPI on GPU nodes. But some of the jobs would run out of memory even with 250GB memory. The error says RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes. which means it was requesting 16.5 GB of memory. Even if I multiply by 10 for each Recycle, that would be 160 GB which still leaves 90GB extra. So I don't know what's happening!!

E0000 00:00:1727970189.766616 2640766 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for 
plugin cuBLAS when one has already been registered
I0000 00:00:1727970528.756185 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491
 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
I0000 00:00:1727970528.816756 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727970529.000277 2640766 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2024-10-03 18:50:37.111627: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 4.01GiB (4304358978 bytes) by rematerialization; only reduced to 13.96GiB (14990920828 bytes), down from 13.96GiB (14990923372 bytes) originally
2024-10-03 18:51:12.979005: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.37GiB (rounded to 16508718336)requested by op 
2024-10-03 18:51:12.979733: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ***********************************************_____________________________________________________
E1003 18:51:12.980048 2640766 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
Traceback (most recent call last):
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 258, in <module>
    main(num_ensemble=1,
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 222, in main
    prediction_result = model_runner.predict(processed_feature_dict)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
patrickbryant1 commented 1 day ago

Hi,This should not be possible. Please check that the GPU RAM is available for the current process and that cache is not built up.Best,Patrick On Oct 3, 2024 18:28, Rohit Satyam @.*> wrote: I am running SpeedPPI on GPU nodes. But some of the jobs would run out of memory even with 250GB memory. The error says RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes. which means it was requesting 16.5 GB of memory. Even if I multiply by 10 for each Recycle, that would be 160 GB which still leaves 90GB extra. So I don't know what's happening!! E0000 00:00:1727970189.766616 2640766 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered I0000 00:00:1727970528.756185 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5 I0000 00:00:1727970528.816756 2640766 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1491 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:89:00.0, compute capability: 7.5 WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1727970529.000277 2640766 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled 2024-10-03 18:50:37.111627: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 4.01GiB (4304358978 bytes) by rematerialization; only reduced to 13.96GiB (14990920828 bytes), down from 13.96GiB (14990923372 bytes) originally 2024-10-03 18:51:12.979005: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.37GiB (rounded to 16508718336)requested by op 2024-10-03 18:51:12.979733: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] *****_____ E1003 18:51:12.980048 2640766 pjrt_stream_executor_client.cc:3067] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes. Traceback (most recent call last): File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 258, in main(num_ensemble=1, File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/run_alphafold_single.py", line 222, in main prediction_result = model_runner.predict(processed_feature_dict) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ibex/scratch/projects/c2077/rohit/SpeedPPI/src/alphafold/model/model.py", line 133, in predict result = self.apply(self.params, jax.random.PRNGKey(0), feat) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 16508718128 bytes.

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>