Open semal opened 1 year ago
What is the minimum GPU configuration required to train with fastfold?
Hi, training of FastFold with DAP is still an experimental feature. Drop dap_size
and DDP is in use by default.
40GB Memory should be enough.
Got another error when only using ddp:
[03/27/23 13:53:21] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:525 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
INFO colossalai - colossalai - INFO: Detecting the number of processes running on the same node..
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:90
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: hostname: ab01
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:92
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: Process group: <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x147f21372530>
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:93
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: Do dist.all_gather_object..
[03/27/23 13:53:21] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:525 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
INFO colossalai - colossalai - INFO: Detecting the number of processes running on the same node..
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:90
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: hostname: ab01
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:92
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: Process group: <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x14c1848e0d30>
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:93
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: Do dist.all_gather_object..
[03/27/23 13:53:24] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:95
detect_num_processes_on_current_node
[03/27/23 13:53:24] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:95
detect_num_processes_on_current_node
INFO colossalai - colossalai - INFO: hostname_list: ['ab01', 'ab01']
INFO colossalai - colossalai - INFO: hostname_list: ['ab01', 'ab01']
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:561 set_seed
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:561 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the
default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the
default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
warnings.warn(
[extension] Compiling or loading the JIT-built cpu_adam kernel during runtime now
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
warnings.warn(
Emitting ninja build file /data/personal/gongchaohui/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
[extension] Time to compile or load cpu_adam op: 0.5188064575195312 seconds
Loading extension module cpu_adam...
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
warnings.warn(
[extension] Compiling or loading the JIT-built fused_optim kernel during runtime now
True
Detected CUDA files, patching ldflags
Emitting ninja build file /data/personal/gongchaohui/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
[extension] Time to compile or load fused_optim op: 0.42969679832458496 seconds
[03/27/23 13:53:33] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:266 initialize
INFO colossalai - colossalai - INFO:
========== Your Config ========
{'torch_ddp': {'static_graph': True}}
================================
INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:278 initialize
INFO colossalai - colossalai - INFO: cuDNN benchmark = False, deterministic = False
Loading extension module fused_optim...
[03/27/23 13:53:34] INFO colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:385 initialize
INFO colossalai - colossalai - INFO: Model is using torch.nn.parallel.DistributedDataParallel for Data Parallelism
INFO colossalai - colossalai - INFO: finetune_with_features_pkl.py:184 main
INFO colossalai - colossalai - INFO: Start training.
aatype torch.Size([384, 4]) torch.int64
aatype torch.Size([384, 4]) torch.int64
Traceback (most recent call last):
File "finetune_with_features_pkl.py", line 218, in <module>
main()
File "finetune_with_features_pkl.py", line 201, in main
output = engine(batch)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
return self.model(*args, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "finetune_with_features_pkl.py", line 60, in forward
outputs = super(AlphaFoldWithBinding, self).forward(batch)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/hub/alphafold.py", line 522, in forward
outputs, m_1_prev, z_prev, x_prev = self.iteration(
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/hub/alphafold.py", line 270, in iteration
template_embeds = self.template_embedder(
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/embedders_multimer.py", line 352, in forward
template_pair_embeddings = template_pair_embeddings + self.template_pair_stack(
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/template.py", line 377, in forward
t, = checkpoint_blocks(
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/utils/checkpointing.py", line 73, in checkpoint_blocks
return exec(blocks, args)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/utils/checkpointing.py", line 60, in exec
a = wrap(block(*a))
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/template.py", line 250, in forward
z = self.TriangleMultiplicationOutgoing(z, single_mask_row)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/ops.py", line 478, in forward
permute_final_dims(right_proj_act_rec, (2, 1, 0)),
UnboundLocalError: local variable 'right_proj_act_rec' referenced before assignment
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 85621 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 85622) of binary: /data/anaconda3/envs/B30/bin/python
Traceback (most recent call last):
File "/data/anaconda3/envs/B30/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune_with_features_pkl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-27_13:53:39
host : ab01
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 85622)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi, apologize for my misunderstanding. Multimer training is not supported yet.
After some code modifications, I found that the multimer model training can be run successfully using bfloat16 precision. However, I still want to know if there are any other issues that may arise?
The thing is that neither the multimer model training nor bf16 precision is proven to work as expected. Usually, not the process raises bugs, but some results are not right, e.g. the accuracy is very low.
Thank you for your response. I understand that both multimer model training and bf16 precision may not always work as expected and can result in low accuracy or other issues. I will take these factors into consideration and thoroughly evaluate the performance。
Thanks for your contribution. Wish you have a nice trip.
I used two 40G graphics cards, but still encountered an error of insufficient video memory. The dap_size is 2 and the tensor parallel size is 2, with no other configurations. Do I need more detailed configurations?