Look into how to use DeepSpeed for inference instead of FSDP

jyaacoub commented 2 months ago

Relevant links:

inference docs: https://deepspeed.readthedocs.io/en/latest/inference-init.html
Getting started tutorial: https://www.deepspeed.ai/tutorials/inference-tutorial/
init_distributed api: https://deepspeed.readthedocs.io/en/latest/initialize.html#deepspeed.init_distributed

Main issues with this are dependency issues with mpi4py on narval... Might need to create container since it requires specific version of openmpi not available on narval

jyaacoub commented 2 months ago

Other than dependency (ModuleNotFound) errors AND issues with AutoTP (Automatic Tensor Parallel wrapping biases for attention heads) the only other thing that needs to be adjusted for deepspeed to work is the following error:

AttributeError: module 'deepspeed.utils' has no attribute 'is_initialized'. Did you mean: 'initialize'?

deepspeed.utils.is_initialized has been deprecated in newer versions of DeepSpeed including v0.12.4
The solution is to replace it with deepspeed.comm.comm.is_initialized() in openfold/model/primitives.py (see openfold issue page and commit hotfix)

jyaacoub commented 2 months ago

DeepSpeed AutoTP

It is not sustainable to hot fix each instance of shape mismatching, I think I should switch gears and look at adjusting AutoTP to properly recognize which modules are good or not to wrap and distribute to gpus.

In the DeepSpeed code, AutoTP is used in deepspeed/inference/engine.py:<InferenceEngine.__init__.py>

replace_with_kernel_inject

Setting replace_with_kernel_inject will avoid using AutoTP and instead do kernel injection.
- This works, but is it optimal?
  Small test
  
  Recall input_19.csv for platinum failed due to memory issues at a sequence length of 392
This was with a single V100 with memory of 16GB

Now testing with 2 V100s also with only 16GB of VRAM and we can run sequences of 624 in length!

index 57 of input_19.csv is 624
- Fails at 705 (so somwhere inbetween 624 and 705 is our limit)

Compared to optimal performance (2x) the replace_with_kernel_inject=True argument is 1.6x which is 80% optimal

What is Kernel injection?

ChatGPT answer:

Purpose: Optimizes performance by injecting specialized inference kernels into specific model components.
Target Models: Mainly used for models like BERT, GPT2, GPT-Neo, and GPT-J.
Functionality:
- Replaces standard layers (e.g., attention and projection) with optimized kernels.
- Enhances computation speed and efficiency.
Configuration:
- Enabled/disabled via replace_with_kernel_inject option.
Benefits:
- Faster inference times.
- Improved hardware utilization.
- Better scalability for handling large models and datasets.

peak memory usage with kernel injections on 2xa100

Max mem for a100 is: 40960MiB

seqLen	peakMem
872	26906 MiB

Assuming linear scaling the max sequence length if peak mem is equal to max a100 is 1327. and this would only be for 2a100s.

[x] Validate that the model is still working correctly with deepspeed

jyaacoub commented 2 months ago

Using Low-memory attention with chunk_size can help trade off compute for memory. See https://github.com/aqlaboratory/openfold?tab=readme-ov-file#monomer-inference.

jyaacoub / MutDTA