Open georgkempf opened 1 year ago
The most likely reason is indeed out of memory, reasoning about single-precision sequences on a 40GB card, 5000 is the limit of length. It is recommended to use --inplace --chunk_size 1.
You may need to set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:15000 to inference such an extreme long sequence. Or you may need to use bfloat16 for inference.
I tried again with --inplace --chunk_size 1 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:15000 but still same error. The card has 80 GB memory. Is my understanding correct that using 2 GPUs would speed-up the job but not increase the memory limit to 160 GB for a long sequence? What would be the best way to switch to bf16?
The cuda execution is asyncronize, so you need to set CUDA_LAUNCH_BLOCKING=1 to locate the bug. Or you can give us the fasta file to reproduce the bug.
This would be the traceback with CUDA_LAUNCH_BLOCKING=1. I saw in the installation instructions that cuda >= 11.4 is suggested for building triton but the environment.yml installs cudatoolkit 11.3 and there doesn't seem to be a colossalai relelease for cuda > 11.3. Could this cause any problems?
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 264, in iteration
template_embeds = self.template_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 339, in forward
pair_act = self.template_pair_embedder(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/embedders_multimer.py", line 215, in forward
query_embedding = self.query_embedding_layer_norm(query_embedding)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/fastnn/kernel/layer_norm.py", line 38, in forward
return LayerNormTritonFunc.apply(input, self.normalized_shape, self.weight, self.bias,
File ".../.../FastFold/fastfold/model/fastnn/kernel/triton/layer_norm.py", line 164, in forward
_layer_norm_fwd_fused[(M,)](
File ".../fastfold/lib/python3.8/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "<string>", line 23, in _layer_norm_fwd_fused
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
The bug has been fixed https://github.com/hpcaitech/FastFold/pull/103 and will merge into the main branch soon.
Great, thank's a lot! Now it was running for some time but another error occurred.
Command line args were:
--gpus 4 --inplace --chunk_size 1
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 373, in iteration
m, z, s = self.evoformer.inplace(
File ".../.../FastFold/fastfold/model/fastnn/evoformer.py", line 319, in inplace
m, z = checkpoint_blocks(
File ".../.../FastFold/fastfold/utils/checkpointing.py", line 73, in checkpoint_blocks
return exec(blocks, args)
File ".../.../FastFold/fastfold/utils/checkpointing.py", line 60, in exec
a = wrap(block(*a))
File ".../.../FastFold/fastfold/model/fastnn/evoformer.py", line 131, in inplace
z = self.communication.inplace(m[0], msa_mask, z)
File ".../.../FastFold/fastfold/model/fastnn/ops.py", line 206, in inplace
left_act = M_mask_col * left_act
RuntimeError: The size of tensor a (1282) must match the size of tensor b (5128) at non-singleton dimension 2
Thanks again for the super fast fix. It is now running for some hours but at some point it still crashes with OOM. I set the max_split_size already down to 1000MB. Is it possible to globally change the precision to fp16?
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File ".../fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File ".../.../FastFold/inference.py", line 136, in inference_model
out = model(batch)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 507, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File ".../.../FastFold/fastfold/model/hub/alphafold.py", line 389, in iteration
outputs["sm"] = self.structure_module(
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 886, in forward
outputs = self._forward_multimer(s, z, aatype, mask)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 825, in _forward_multimer
s = s + self.ipa(s, z, rigids, mask)
File ".../fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File ".../.../FastFold/fastfold/model/nn/structure_module.py", line 397, in forward
pt_att = sum([c**2 for c in pt_att])
RuntimeError: CUDA out of memory. Tried to allocate 4.70 GiB (GPU 0; 79.21 GiB total capacity; 77.87 GiB already allocated; 103.12 MiB free; 77.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It goes out of memory in structure generation because multimer structure generation consumes much more memory than monomer sequence. But sadly we haven't optimized that part yet.
Monomer supports bf16, but multimer doesn't yet. Maybe we will support it in the future.
Looking forward to these optimizations. Overall great project!
Monomer supports bf16, but multimer doesn't yet. Maybe we will support it in the future.
How to use bf16 or fp32 to reduce memory use in GPU calculation? Thanks for your kind help!
I tried to predict a 5 subunit complex (in total ~5000 aa) and get the following error with various settings (1-4x A100 80GB, w/ and w/o --inplace, w/ and w/o --chunk_size 1-32). The error seems to be associated with exceeding the GPU memory and I am not sure if this is normal at the given sequence length and available GPU memory. I installed fastfold from the recent commit 930a58a into a clean conda environment and built triton from source. For a smaller complex (~2000 aa) it ran without errors.