google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.35k stars 2.21k forks source link

ValueError: The number of positions must match the number of atoms #246

Closed aldrinlugena closed 2 years ago

aldrinlugena commented 2 years ago

Hi. Alphafold is a fantastic tool! Thank you so much for this.

I encountered this error (ValueError: The number of positions must match the number of atoms) that is associated to a specific protein run in multimer mode. No problem encountered when this specific protein was analyzed in monomer mode. No problem as well when other proteins were analyzed in multimer mode. Only when this specific protein is included in multimer run that I get the error. Have you encountered this type of error? If so, what can you recommend for possible solutions? I can share the details of my analysis if needed. Thank you.

Rampakslue commented 2 years ago

Hi! I get the same errors sometimes, just wanted to add my comment to bump the issue.

This is the traceback I get from the error:

Traceback (most recent call last): File "/app/alphafold/run_alphafold.py", line 427, in app.run(main) File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/app/alphafold/run_alphafold.py", line 403, in main predict_structure( File "/app/alphafold/run_alphafold.py", line 237, in predict_structure relaxed_pdbstr, , _ = amber_relaxer.process(prot=unrelaxed_protein) File "/app/alphafold/alphafold/relax/relax.py", line 73, in process min_pdb = utils.overwrite_pdb_coordinates(pdb_str, min_pos) File "/app/alphafold/alphafold/relax/utils.py", line 29, in overwrite_pdb_coordinates openmm_app.PDBFile.writeFile(topology, pos, f) File "/opt/conda/lib/python3.8/site-packages/simtk/openmm/app/pdbfile.py", line 283, in writeFile PDBFile.writeModel(topology, positions, file, keepIds=keepIds, extraParticleIdentifier=extraParticleIdentifier) File "/opt/conda/lib/python3.8/site-packages/simtk/openmm/app/pdbfile.py", line 331, in writeModel raise ValueError('The number of positions must match the number of atoms') ValueError: The number of positions must match the number of atoms

I've received this issue once before, but the curious part is that the second time I've encountered this error is for a multimer run that I've already successfully completed before. I was re-running it as part of a validation/testing procedure. The latest error occurred 19 hours and 23 min after the start of the job. Or I guess to be more clear I've run this complex in multimer 3 times, the 2nd time it successfully completed after 26 hours, but it errored out the first and third time.

I have attached the input fasta file, status log, as well as the slurm file used to initiate the job: Fasta File: WCC.txt Output log: slurm-1796751.txt Slurm file: WCC_slurm.txt

Also don't know if this matters/might be related to the errors, @aldrinlugena and I are from the same university and running the jobs on the same HPRC Clusters.

arashnh11 commented 2 years ago

Can you remove anything after ">protein_name" and rerun? We had this issue for some multimers that had SwissProt identifiers in. Failing after several hours.

Rampakslue commented 2 years ago

The first run was done on AlphaFold v2.1.0, so I had tried to run it with >chain_0, and >chain_1. That one failed whereas the full fasta identifier one had succeeded. But running it this 3rd time with the full fasta identifier failed. (This was run using AlphaFold v2.1.1)

I also had another multimer job fail out with the same exact error. DROME_CLOCK fasta: Drome Clock - Copy.txt But a different multimer job that started at the same time succeeded. HUMAN_CLOCK fasta: Human Clock - Copy.txt

Also, I thought that AlphaFold v2.1.1 had addressed this issue?

YaoYinYing commented 2 years ago

got the same problem, and solved by reorder the subunits on the full fasta file.... Hmmm its really weird. In the beginning i put the longest one (over 1k aa) as chain A but now its the last chain. No idea.

jimkwon commented 2 years ago

I have the same error.... Any updates?

cuzy-zhi commented 2 years ago

Same problem encounted. I use Alphafold2-multimer to predict a multimer composed of 8 same monomer. In my case, it failed after throw the 'the number of positions must match the number of atoms' for the first time and failed again with same error message though I rename each sequence to chain_A, chain_B, chain_C, ect.

Could you guys give any advice on such problem? Thanks a lot.

simone-pignotti commented 2 years ago

Same issue with v2.1.1 (docker image built on the very same AWS instance), folding a homo3mer with the following command:

python3 /data/alphafold/docker/run_docker.py --fasta_paths=test_lambda_homo3mer.fa --data_dir=/data/alphafold_data --model_preset=multimer --is_prokaryote_list=true --db_preset=full_dbs --output_dir /data/test_af/predictions_2 --max_template_date 2021-12-07 --log_dir /data/test_af/logs_2 --use_precomputed_msas --use_cprofile_for_profiling --profile_file test_lambda_homo3mer_4.prof &>test_lambda_homo3mer_4.log

I am attaching the input fasta and the logs from the run. Hope that helps! test_lambda_homo3mer_4.log test_lambda_homo3mer.fa.txt

MDuot commented 2 years ago

Hello,

I use AF2 2.1.1 with a non docker setup (https://github.com/kalininalab/alphafold_non_docker) and after multiples test I can't finish the folding when the sum of the length of the 2 protein are more 1k aa, even after changing their name or their order in the fasta file. Does someone as figure the source of this problem ? Thanks.

yamule commented 2 years ago

It looks that Einit is quite large and had problems in amber minimization steps. Can you find unrelaxed_modelX.pdb in your output directory? The models may have many atom clashes. Possibly they encountered problem discussed around here https://twitter.com/sokrypton/status/1457639018141728770 . & https://github.com/deepmind/alphafold/issues/236

MDuot commented 2 years ago

Thanks for your reply @yamule, I also think that this problem came from the amber minimization steps, each time AF2 stop just after creating an unrelaxed-modelX.pdb. But I don't think as describe in (https://twitter.com/sokrypton/status/1457639018141728770), If I understand correctly his error comes from a MSA when the sequence aren't correctly padding (without a linker between the 2 sequence). But AF v2.1.1 doesn't need any linker because it create 1 MSA for each protein, and I obtain this error on homodimer but also on heterodimer, so I don't think the problem comes from a duplication of the MSA. Do you think their are a kind of length limit for AF2 multimer ? I never encounter this error with proteins under 300 aa.

yamule commented 2 years ago

Hmm...

If I understand correctly his error comes from a MSA when the sequence aren't correctly padding (without a linker between the 2 sequence).

With my understandings, the problem he is talking is come from the "complete covariation" of amino acids. Let's say ,if you concatenate same MSAs for homomer sequences like

AAAAAAAAAA
ATAAAAAASA
+
AAAAAAAAAA
ATAAAAAASA
->
AAAAAAAAAAAAAAAAAAAA
ATAAAAAASAATAAAAAASA

, both 'T' and 'S' positions are completely "covary". (Described as OFTEN FAIL in his tweet, & I think AF2-multimer just concatenate MSAs like above, too.) Theoretically, protein structure prediction programs like AF become to think covarying amino acids are interacting; they may think the first 'T' is interacting not only the first 'S' but also second 'T' and second 'S'.

But it will happen only for homomer, so if the problems happened for heterodimer, there may be other problems.

Do you think their are a kind of length limit for AF2 multimer ? I never encounter this error with proteins under 300 aa.

I don't have any idea but AF2 was said to be trained with proteins cropped with 384 aa, it might be able to handle such short proteins correctly.

(1/11 Sorry, I had misunderstood what MDuot was mentioning. In summary, "the sequence aren't correctly padding" is the way which official AF2-multimer is doing. The sequences which are not paired are "correctly padded" but most of sequences are paired in homomer case so they are "not correctly padded", I think.)

davidyanglee commented 2 years ago

Interesting when I get the same error, one unrelaxed model has already finished and when opened in Pymol, it started with Chain B. I tried to color Chain A but to no avail and cannot see what happen to chain A - no where to be found.

yamule commented 2 years ago

I think AF2's output always starts from Chain B. Because chain index starts with "1", not "0". https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L144

https://github.com/deepmind/alphafold/issues/251

(Update: Sorry, this is true only for files with unrelaxed* or all? files with the option --norun_relax.)

yamule commented 2 years ago

As for the technical side, this error could be due to newly created or removed disulfide bonds around here. https://github.com/deepmind/alphafold/blob/9c4ac8a92125942f73813649d9f6885532c1ee97/alphafold/relax/amber_minimize.py#L484 As the hydrogens and information of CONECT records are discarded, disulfide bonds are reassigned with "clean_protein" in L486. (This means that this error is more likely to occur in predictions with many clashes.)

Hence, after this function, the number of atoms can be different. https://github.com/deepmind/alphafold/blob/9c4ac8a92125942f73813649d9f6885532c1ee97/alphafold/relax/relax.py#L61

berjeh commented 2 years ago

Hi all, @cuzy-zhi has anyone found a solution to this problem? I have a pentamer. Predicting the mono, di and trimer work. But starting from tetra and pentamer I receive this error.

Augustin-Zidek commented 2 years ago

Could you try with AlphaFold v2.2.0 which significantly reduced the number of clashes for multimer predictions (and hence makes the job easier for the relaxation stage)?

berjeh commented 2 years ago

Could you try with AlphaFold v2.2.0 which significantly reduced the number of clashes for multimer predictions (and hence makes the job easier for the relaxation stage)?

Hi Augustin, yes after the update AlphaFold works fine. At least no errors so far. Thank you.

Augustin-Zidek commented 2 years ago

Closing this now as updating to AlphaFold v2.2.0 seems to fix this issue.