Is it possible to deploy this on a 24G GPU?

aixiaodewugege commented 1 week ago

"Is it possible to deploy this on a 24G GPU? The documentation mentions a requirement for a 80G A100 GPU, but are there ways to run it on a 24G card instead? Perhaps by trading off some performance and using a pipelined approach to load weights progressively?"

jsspencer commented 1 week ago

We have not extensively tested GPUs other than A100 and H100. AlphaFold 3 can run on other devices though the maximum sequence size is determined by the available GPU memory. The performance docs contain details on how to optimise memory usage using unified memory and the pair_transition_shard_spec for A100 40GB. For GPUs like the V100 (CUDA Compute Capability < 8), flash attention must be disabled. GPUs with CUDA Compute Capability 8 or higher can leverage flash attention for significant speed improvements.

linuxfold commented 1 week ago

I was able to run the test prediction on a 24GB 4090.

TalosLong commented 1 week ago

Nov 14, 2024

Can you please tell me how you modified the code settings and if there were any issues during the use of small video memory

aixiaodewugege commented 1 week ago

I was able to run the test prediction on a 24GB 4090.

Great news! Have you tested the impact of maximum sequence length on performance or results?

linuxfold commented 1 week ago

Nov 14, 2024

Can you please tell me how you modified the code settings and if there were any issues during the use of small video memory

I did not modify the settings at all for that first prediction. It was the stock settings (not using unified memory).

Great news! Have you tested the impact of maximum sequence length on performance or results?

It used about 23 out of 24GB VRAM so the test file might be nearing the limit of what a 24GB card can do but I have not yet tested the impact of increasing sequence length or using unified memory.

linuxfold commented 1 week ago

Great news! Have you tested the impact of maximum sequence length on performance or results?

Can confirm successful predictions on both 24GB 3090 and 4090 and --gpu device=0 flag works fine.

I tried doing a trimer prediction for the test file and a hexamer prediction for the test file - both produced an output without any errors.

I also did a prediction of 3 chains with around 800 amino acids total - worked perfectly fine and output structure as expected and essentially the same as online alphafoldserver output.

All predictions use around 23GB of VRAM (with maybe an increase of a few hundred MiB) so VRAM usage is not a good predictor of how much longer the sequence can be. These were all done with stock settings (not using unified memory). Overall looks like it works well based on these few test cases.

HanLab-OSU commented 1 week ago

I was also able to run the test prediction on a 24GB 4090. It took about 7-8 minutes using the default settings. Any changes in the settings I can make to accelerate the prediction process?

linuxfold commented 1 week ago

I was also able to run the test prediction on a 24GB 4090. It took about 7-8 minutes using the default settings. Any changes in the settings I can make to accelerate the prediction process?

It took my computer about 6 mins for Jackhmmer, 3.5 mins for Hmmsearch, and about 90 seconds for inference on the 4090. It was also power limited to 300W.

I also did a prediction of 3 chains with around 800 amino acids total - worked perfectly fine and output structure as expected and essentially the same as online alphafoldserver output.

I was also able to get an output for a triplet of the 800 amino acid sequence I mentioned earlier (~2400 amino acids total). It produced an output without errors.

Augustin-Zidek commented 1 week ago

It is fantastic to hear it is working and running with decent speed on GPUs with less RAM! Thanks for confirming.

jsspencer commented 1 week ago

All predictions use around 23GB of VRAM

Assuming you are running inside the docker container or following the instructions in the supplied Dockerfile, this is most likely because XLA is configured to preallocate most of the device memory at the start: https://github.com/google-deepmind/alphafold3/blob/main/docker/Dockerfile#L54

smg3d commented 1 week ago

Also runs well on RTX-3090 (Driver Version: 560.35.03 CUDA Version: 12.6) AlphaFold test protein 2 * 298 residues (homodimer): Running model inference for seed 1 took 99.57 seconds.

google-deepmind / alphafold3

Is it possible to deploy this on a 24G GPU? #9