Taking a long time for predictions + more output files than expected

kaymccoy commented 2 years ago

I've set up alphafold_non_docker and it appears to be running properly, but my tests have taken over 5 days without finishing yet, so I think something's probably going wrong - hopefully someone will have an idea! One is a multimer run of a relatively small antibody variable region (129 H residues, 108 L residues), and one is a test of a single chain antigen test with ~500 residues. I'm running on nodes with K80 GPUs and 125G memory. As an example, I'll attach the output of the antibody run to this post rather than copying it over (since it's quite long).

The corresponding command for the antibody run is: ./run_alphafold.sh -d /dartfs/rc/lab/G/Grigoryanlab/library/AlphaFoldEtc/alphafold_DBs/ -o /dartfs/rc/lab/G/Grigoryanlab/home/coy/Dartmouth_PhD_Repo/antibodyTestMC4/ -f /dartfs/rc/lab/G/Grigoryanlab/library/AlphaFoldEtc/antibodyTestMC.fasta -t 2021-10-04 -m multimer

My best guess as to why it's taking so long is that it's calling model.py more than it's supposed to? As far as I can tell, the README from alphafold indicates there should be 5 models (one from each seed) in the output, but I'm getting output that seems to indicate model.py has been called 23 times already, and there will probably be 25 models total when it finishes, according to the pattern of the models being produced. Here's a ls of my output directory for the antibody test:

features.pkl relaxed_model_5_multimer_v2_pred_4.pdb msas result_model_1_multimer_v2_pred_0.pkl ranked_0.pdb result_model_1_multimer_v2_pred_1.pkl ranked_10.pdb result_model_1_multimer_v2_pred_2.pkl ranked_11.pdb result_model_1_multimer_v2_pred_3.pkl ranked_12.pdb result_model_1_multimer_v2_pred_4.pkl ranked_13.pdb result_model_2_multimer_v2_pred_0.pkl ranked_14.pdb result_model_2_multimer_v2_pred_1.pkl ranked_15.pdb result_model_2_multimer_v2_pred_2.pkl ranked_16.pdb result_model_2_multimer_v2_pred_3.pkl ranked_17.pdb result_model_2_multimer_v2_pred_4.pkl ranked_18.pdb result_model_3_multimer_v2_pred_0.pkl ranked_19.pdb result_model_3_multimer_v2_pred_1.pkl ranked_1.pdb result_model_3_multimer_v2_pred_2.pkl ranked_20.pdb result_model_3_multimer_v2_pred_3.pkl ranked_21.pdb result_model_3_multimer_v2_pred_4.pkl ranked_22.pdb result_model_4_multimer_v2_pred_0.pkl ranked_23.pdb result_model_4_multimer_v2_pred_1.pkl ranked_24.pdb result_model_4_multimer_v2_pred_2.pkl ranked_2.pdb result_model_4_multimer_v2_pred_3.pkl ranked_3.pdb result_model_4_multimer_v2_pred_4.pkl ranked_4.pdb result_model_5_multimer_v2_pred_0.pkl ranked_5.pdb result_model_5_multimer_v2_pred_1.pkl ranked_6.pdb result_model_5_multimer_v2_pred_2.pkl ranked_7.pdb result_model_5_multimer_v2_pred_3.pkl ranked_8.pdb result_model_5_multimer_v2_pred_4.pkl ranked_9.pdb timings.json ranking_debug.json unrelaxed_model_1_multimer_v2_pred_0.pdb relaxed_model_1_multimer_v2_pred_0.pdb unrelaxed_model_1_multimer_v2_pred_1.pdb relaxed_model_1_multimer_v2_pred_1.pdb unrelaxed_model_1_multimer_v2_pred_2.pdb relaxed_model_1_multimer_v2_pred_2.pdb unrelaxed_model_1_multimer_v2_pred_3.pdb relaxed_model_1_multimer_v2_pred_3.pdb unrelaxed_model_1_multimer_v2_pred_4.pdb relaxed_model_1_multimer_v2_pred_4.pdb unrelaxed_model_2_multimer_v2_pred_0.pdb relaxed_model_2_multimer_v2_pred_0.pdb unrelaxed_model_2_multimer_v2_pred_1.pdb relaxed_model_2_multimer_v2_pred_1.pdb unrelaxed_model_2_multimer_v2_pred_2.pdb relaxed_model_2_multimer_v2_pred_2.pdb unrelaxed_model_2_multimer_v2_pred_3.pdb relaxed_model_2_multimer_v2_pred_3.pdb unrelaxed_model_2_multimer_v2_pred_4.pdb relaxed_model_2_multimer_v2_pred_4.pdb unrelaxed_model_3_multimer_v2_pred_0.pdb relaxed_model_3_multimer_v2_pred_0.pdb unrelaxed_model_3_multimer_v2_pred_1.pdb relaxed_model_3_multimer_v2_pred_1.pdb unrelaxed_model_3_multimer_v2_pred_2.pdb relaxed_model_3_multimer_v2_pred_2.pdb unrelaxed_model_3_multimer_v2_pred_3.pdb relaxed_model_3_multimer_v2_pred_3.pdb unrelaxed_model_3_multimer_v2_pred_4.pdb relaxed_model_3_multimer_v2_pred_4.pdb unrelaxed_model_4_multimer_v2_pred_0.pdb relaxed_model_4_multimer_v2_pred_0.pdb unrelaxed_model_4_multimer_v2_pred_1.pdb relaxed_model_4_multimer_v2_pred_1.pdb unrelaxed_model_4_multimer_v2_pred_2.pdb relaxed_model_4_multimer_v2_pred_2.pdb unrelaxed_model_4_multimer_v2_pred_3.pdb relaxed_model_4_multimer_v2_pred_3.pdb unrelaxed_model_4_multimer_v2_pred_4.pdb relaxed_model_4_multimer_v2_pred_4.pdb unrelaxed_model_5_multimer_v2_pred_0.pdb relaxed_model_5_multimer_v2_pred_0.pdb unrelaxed_model_5_multimer_v2_pred_1.pdb relaxed_model_5_multimer_v2_pred_1.pdb unrelaxed_model_5_multimer_v2_pred_2.pdb relaxed_model_5_multimer_v2_pred_2.pdb unrelaxed_model_5_multimer_v2_pred_3.pdb relaxed_model_5_multimer_v2_pred_3.pdb unrelaxed_model_5_multimer_v2_pred_4.pdb

As you can see, the pattern for the output files is like: "relaxedmodel{0-4}_multimer_v2pred{0-4}.pdb". I'm not sure what the numbers 0-4 indicate; I'd assume one of them indicates models that come from the same starting seed, but I'm not sure what the other set of 0-4 would indicate / which one of the two places is the one that indicates a shared seed. Apologies if this is indicated somewhere and I've missed it! Thanks so much for any help on how to make this run faster / if the output is correct or not.

EDIT: I've since tried running the antibody test with CPUs only (no GPUs, so using the -e false -g false flags appended to the aforementioned command) and it takes ~16 hours. The GPU test recently finished and took 6 days in total! I was able to request 200 GB memory for the CPU test and only 125 GB memory for the GPU test, which might indicate memory is the limiting factor? I also updated the ls of the output dir above to have the final output files.

EDIT2: Large complexes (~2000 aa long) take a very long amount of time on CPU - about a month - and longer on GPU. Even with the same amount of memory, the GPU runs take longer than the CPU runs.

alonmillet commented 2 years ago

I can't speak to your runtimes, but the pattern you're seeing of five outputs per model parameter set is the five seeds that the model runs as default. See the following from the original, dockerized AlphaFold github:

By default the multimer system will run 5 seeds per model (25 total predictions) for a small drop in accuracy you may wish to run a single seed per model. This can be done via the --num_multimer_predictions_per_model flag, e.g. set it to --num_multimer_predictions_per_model=1 to run a single seed per model.

ranked_*.pdb – A PDB format text file containing the relaxed predicted structures, after reordering by model confidence. Here ranked_0.pdb should contain the prediction with the highest confidence, and ranked_4.pdb the prediction with the lowest confidence. To rank model confidence, we use predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details).

You can also read more about this in the supplemental data in the AlphaFold paper in Nature.

sanjaysrikakulam commented 2 years ago

Hi @kaymccoy

Sorry for the delayed response. As commented by @alonmillet the number of models issue must be clear to you now I hope. But, it is weird the amount of time it's taking to run the predictions. Unfortunately, I do not have an assumption to address this issue. Maybe create a ticket on the AF2 github repo and the authors of the tool may respond and provide you with more details and help you troubleshoot the issue.

kaymccoy commented 2 years ago

That makes sense - thanks so much for the explanation on the models. I've contacted some people who do IT for our cluster and they're looking into why the GPU jobs run so slowly, so hopefully we'll get some results there! I'll update this if they find anything.

kalininalab / alphafold_non_docker

Taking a long time for predictions + more output files than expected #41