Closed superantichrist closed 1 year ago
Hi, this is related to your infrastructure. Some of your sequences don't fit on your GPU which causes this OOM. Try to figure out which ones these are and rerun the predictions without them or try to obtain a GPU with more RAM.
I ran all_to_all and passed pre1, pred2, but after that got memory error.
a_stack/triangle_multiplication_outgoing/gating_linear/...cb,cd->...db/jit(_einsum)/dot_general[dimension_numbers=(((0,), (1,)), ((), ())) precision=None preferred_element_type=None]" source_file="/m2/SpeedPPI/src/alphafold/model/common_modules.py" source_line=76 XLA Label: custom-call Shape: f32[128,1317904]
The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "./src/run_alphafold_all_vs_all.py", line 306, in
main(num_ensemble=1,
File "./src/run_alphafold_all_vs_all.py", line 269, in main
prediction_result = model_runner.predict(processed_feature_dict)
File "/m2/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 8130147888 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 2.59GiB
constant allocation: 38.6KiB
maybe_live_out allocation: 373.95MiB
preallocated temp allocation: 7.57GiB
total allocation: 10.53GiB
total fragmentation: 162.30MiB (1.51%)
Peak buffers:
Buffer 1:
Size: 1.40GiB
Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446
XLA Label: fusion
Shape: f32[5120,1148,64]
mkdir: cannot create directory ‘./data/dev/all_vs_all/pred3/’: File exists Running pred 3 out of 5 Evaluating pair 4IFD_C-4IFD_J 2023-04-20 16:17:37.870533: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6688115632 exceeds 10% of free system memory. 2023-04-20 16:17:37.941510: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6932018392 exceeds 10% of free system memory. 2023-04-20 16:17:43.933476: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6932018392 exceeds 10% of free system memory. 2023-04-20 16:17:47.914388: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6932018392 exceeds 10% of free system memory. 2023-04-20 16:17:51.906020: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6932018392 exceeds 10% of free system memory. /m2/SpeedPPI/src/alphafold/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead. values_tree_def = jax.tree_flatten(values)[1] /m2/SpeedPPI/src/alphafold/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead. return jax.tree_unflatten(values_tree_def, flat_axes) /m2/SpeedPPI/src/alphafold/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead. flat_sizes = jax.tree_flatten(in_sizes)[0] 2023-04-20 16:20:11.643339: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.53GiB (rounded to 10228218624)requested by op 2023-04-20 16:20:11.643664: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:492] *****___ 2023-04-20 16:20:11.649408: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 10228218416 bytes. BufferAssignment OOM Debugging. BufferAssignment stats: parameter allocation: 2.86GiB constant allocation: 38.7KiB maybe_live_out allocation: 462.23MiB preallocated temp allocation: 9.53GiB preallocated temp fragmentation: 784.48MiB (8.04%) total allocation: 12.83GiB total fragmentation: 1.17GiB (9.13%) Peak buffers: Buffer 1: Size: 1.57GiB Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446 XLA Label: fusion Shape: f32[5120,1286,64]
Traceback (most recent call last): File "./src/run_alphafold_all_vs_all.py", line 306, in
main(num_ensemble=1,
File "./src/run_alphafold_all_vs_all.py", line 269, in main
prediction_result = model_runner.predict(processed_feature_dict)
File "/m2/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, *kwargs)
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/api.py", line 623, in cache_miss
out_flat = call_bind_continuation(execute(args_flat))
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
out_flat = compiled.execute(in_flat)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 10228218416 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 2.86GiB
constant allocation: 38.7KiB
maybe_live_out allocation: 462.23MiB
preallocated temp allocation: 9.53GiB
preallocated temp fragmentation: 784.48MiB (8.04%)
total allocation: 12.83GiB
total fragmentation: 1.17GiB (9.13%)
Peak buffers:
Buffer 1:
Size: 1.57GiB
Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446
XLA Label: fusion
Shape: f32[5120,1286,64]
The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "./src/run_alphafold_all_vs_all.py", line 306, in
main(num_ensemble=1,
File "./src/run_alphafold_all_vs_all.py", line 269, in main
prediction_result = model_runner.predict(processed_feature_dict)
File "/m2/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 10228218416 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 2.86GiB
constant allocation: 38.7KiB
maybe_live_out allocation: 462.23MiB
preallocated temp allocation: 9.53GiB
preallocated temp fragmentation: 784.48MiB (8.04%)
total allocation: 12.83GiB
total fragmentation: 1.17GiB (9.13%)
Peak buffers:
Buffer 1:
Size: 1.57GiB
Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446
XLA Label: fusion
Shape: f32[5120,1286,64]
mkdir: cannot create directory ‘./data/dev/all_vs_all/pred4/’: File exists Running pred 4 out of 5 Evaluating pair 4IFD_J-4IFD_A 2023-04-20 16:21:01.340203: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 9673627392 exceeds 10% of free system memory. 2023-04-20 16:21:01.406058: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 10055011200 exceeds 10% of free system memory. 2023-04-20 16:21:33.834163: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 10055011200 exceeds 10% of free system memory. 2023-04-20 16:21:39.553369: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 10055011200 exceeds 10% of free system memory. 2023-04-20 16:21:45.237304: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 10055011200 exceeds 10% of free system memory. /m2/SpeedPPI/src/alphafold/model/mapping.py:49: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead. values_tree_def = jax.tree_flatten(values)[1] /m2/SpeedPPI/src/alphafold/model/mapping.py:53: FutureWarning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead. return jax.tree_unflatten(values_tree_def, flat_axes) /m2/SpeedPPI/src/alphafold/model/mapping.py:124: FutureWarning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead. flat_sizes = jax.tree_flatten(in_sizes)[0] 2023-04-20 16:24:19.842373: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 8.41GiB (rounded to 9031036416)requested by op 2023-04-20 16:24:19.844742: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:492] ****____ 2023-04-20 16:24:19.850298: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9031036208 bytes. BufferAssignment OOM Debugging. BufferAssignment stats: parameter allocation: 2.78GiB constant allocation: 38.6KiB maybe_live_out allocation: 436.99MiB preallocated temp allocation: 8.41GiB total allocation: 11.62GiB total fragmentation: 324.96MiB (2.73%) Peak buffers: Buffer 1: Size: 1.52GiB Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446 XLA Label: fusion Shape: f32[5120,1248,64]
Traceback (most recent call last): File "./src/run_alphafold_all_vs_all.py", line 306, in
main(num_ensemble=1,
File "./src/run_alphafold_all_vs_all.py", line 269, in main
prediction_result = model_runner.predict(processed_feature_dict)
File "/m2/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, *kwargs)
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/api.py", line 623, in cache_miss
out_flat = call_bind_continuation(execute(args_flat))
File "/home/numu/anaconda3/envs/SpeedPPI/lib/python3.8/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
out_flat = compiled.execute(in_flat)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9031036208 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 2.78GiB
constant allocation: 38.6KiB
maybe_live_out allocation: 436.99MiB
preallocated temp allocation: 8.41GiB
total allocation: 11.62GiB
total fragmentation: 324.96MiB (2.73%)
Peak buffers:
Buffer 1:
Size: 1.52GiB
Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446
XLA Label: fusion
Shape: f32[5120,1248,64]
The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "./src/run_alphafold_all_vs_all.py", line 306, in
main(num_ensemble=1,
File "./src/run_alphafold_all_vs_all.py", line 269, in main
prediction_result = model_runner.predict(processed_feature_dict)
File "/m2/SpeedPPI/src/alphafold/model/model.py", line 133, in predict
result = self.apply(self.params, jax.random.PRNGKey(0), feat)
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9031036208 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 2.78GiB
constant allocation: 38.6KiB
maybe_live_out allocation: 436.99MiB
preallocated temp allocation: 8.41GiB
total allocation: 11.62GiB
total fragmentation: 324.96MiB (2.73%)
Peak buffers:
Buffer 1:
Size: 1.52GiB
Operator: op_name="jit(apply_fn)/jit(main)/alphafold/while/body/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/outer_product_mean/layer_norm_input/jit(_var)/reduce_sum[axes=(2,)]" source_file="/m2/SpeedPPI/src/alphafold/model/modules.py" source_line=1446
XLA Label: fusion
Shape: f32[5120,1248,64]
mkdir: cannot create directory ‘./data/dev/all_vs_all/pred5/’: File exists Running pred 5 out of 5 Saved all PPIs before filtering on pDockQ to ./data/dev/all_vs_all/all_ppis_unfiltered.csv Filtered PPI network on pDockQ> 0.5 resulting in 1 interactions. Saved all PPIs after filtering on pDockQ to ./data/dev/all_vs_all/ppis_filtered.csv mkdir: cannot create directory ‘./data/dev/all_vs_all/high_confidence_preds/’: File exists Moved all high confidence predictions to ./data/dev/all_vs_all/high_confidence_preds/