hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
556 stars 84 forks source link

Multimer training and BF16 #168

Open semal opened 1 year ago

semal commented 1 year ago
image

I used two 40G graphics cards, but still encountered an error of insufficient video memory. The dap_size is 2 and the tensor parallel size is 2, with no other configurations. Do I need more detailed configurations?

semal commented 1 year ago

What is the minimum GPU configuration required to train with fastfold?

Gy-Lu commented 1 year ago

Hi, training of FastFold with DAP is still an experimental feature. Drop dap_size and DDP is in use by default. 40GB Memory should be enough.

semal commented 1 year ago

Got another error when only using ddp:

[03/27/23 13:53:21] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:525 set_device            
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                   
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:109 launch                              
                    INFO     colossalai - colossalai - INFO: Detecting the number of processes running on the same node..                                                          
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:90                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: hostname: ab01                                                                                                        
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:92                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: Process group: <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x147f21372530>                                 
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:93                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: Do dist.all_gather_object..                                                                                           
[03/27/23 13:53:21] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:525 set_device            
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                   
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:109 launch                              
                    INFO     colossalai - colossalai - INFO: Detecting the number of processes running on the same node..                                                          
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:90                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: hostname: ab01                                                                                                        
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:92                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: Process group: <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x14c1848e0d30>                                 
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:93                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: Do dist.all_gather_object..                                                                                           
[03/27/23 13:53:24] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:95                        
                             detect_num_processes_on_current_node                                                                                                                  
[03/27/23 13:53:24] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:95                        
                             detect_num_processes_on_current_node                                                                                                                  
                    INFO     colossalai - colossalai - INFO: hostname_list: ['ab01', 'ab01']                                                                                       
                    INFO     colossalai - colossalai - INFO: hostname_list: ['ab01', 'ab01']                                                                                       
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:561 set_seed              
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/context/parallel_context.py:561 set_seed              
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the  
                             default parallel seed is ParallelMode.DATA.                                                                                                           
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the  
                             default parallel seed is ParallelMode.DATA.                                                                                                           
                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:117 launch                              
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1     
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
  warnings.warn(
[extension] Compiling or loading the JIT-built cpu_adam kernel during runtime now
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
  warnings.warn(
Emitting ninja build file /data/personal/gongchaohui/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
[extension] Time to compile or load cpu_adam op: 0.5188064575195312 seconds
Loading extension module cpu_adam...
True
/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:95: UserWarning: [extension] The CUDA version on the system (11.4) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions
  warnings.warn(
[extension] Compiling or loading the JIT-built fused_optim kernel during runtime now
True
Detected CUDA files, patching ldflags
Emitting ninja build file /data/personal/gongchaohui/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
[extension] Time to compile or load fused_optim op: 0.42969679832458496 seconds
[03/27/23 13:53:33] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:266 initialize                          
                    INFO     colossalai - colossalai - INFO:                                                                                                                       
                             ========== Your Config ========                                                                                                                       
                             {'torch_ddp': {'static_graph': True}}                                                                                                                 
                             ================================                                                                                                                      

                    INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:278 initialize                          
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark = False, deterministic = False                                                                        
Loading extension module fused_optim...
[03/27/23 13:53:34] INFO     colossalai - colossalai - INFO: /data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/initialize.py:385 initialize                          
                    INFO     colossalai - colossalai - INFO: Model is using torch.nn.parallel.DistributedDataParallel for Data Parallelism                                         
                    INFO     colossalai - colossalai - INFO: finetune_with_features_pkl.py:184 main                                                                                
                    INFO     colossalai - colossalai - INFO: Start training.                                                                                                       
aatype torch.Size([384, 4]) torch.int64
aatype torch.Size([384, 4]) torch.int64
Traceback (most recent call last):
  File "finetune_with_features_pkl.py", line 218, in <module>
    main()
  File "finetune_with_features_pkl.py", line 201, in main
    output = engine(batch)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
    return self.model(*args, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "finetune_with_features_pkl.py", line 60, in forward
    outputs = super(AlphaFoldWithBinding, self).forward(batch)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/hub/alphafold.py", line 522, in forward
    outputs, m_1_prev, z_prev, x_prev = self.iteration(
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/hub/alphafold.py", line 270, in iteration
    template_embeds = self.template_embedder(
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/embedders_multimer.py", line 352, in forward
    template_pair_embeddings = template_pair_embeddings + self.template_pair_stack(
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/template.py", line 377, in forward
    t, = checkpoint_blocks(
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/utils/checkpointing.py", line 73, in checkpoint_blocks
    return exec(blocks, args)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/utils/checkpointing.py", line 60, in exec
    a = wrap(block(*a))
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/template.py", line 250, in forward
    z = self.TriangleMultiplicationOutgoing(z, single_mask_row)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/ops.py", line 478, in forward
    permute_final_dims(right_proj_act_rec, (2, 1, 0)),
UnboundLocalError: local variable 'right_proj_act_rec' referenced before assignment
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 85621 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 85622) of binary: /data/anaconda3/envs/B30/bin/python
Traceback (most recent call last):
  File "/data/anaconda3/envs/B30/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/anaconda3/envs/B30/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_with_features_pkl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-27_13:53:39
  host      : ab01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 85622)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Gy-Lu commented 1 year ago

Hi, apologize for my misunderstanding. Multimer training is not supported yet.

semal commented 1 year ago

After some code modifications, I found that the multimer model training can be run successfully using bfloat16 precision. However, I still want to know if there are any other issues that may arise?

Gy-Lu commented 1 year ago

The thing is that neither the multimer model training nor bf16 precision is proven to work as expected. Usually, not the process raises bugs, but some results are not right, e.g. the accuracy is very low.

semal commented 1 year ago

Thank you for your response. I understand that both multimer model training and bf16 precision may not always work as expected and can result in low accuracy or other issues. I will take these factors into consideration and thoroughly evaluate the performance。

Gy-Lu commented 1 year ago

Thanks for your contribution. Wish you have a nice trip.