marrink-lab / vermouth-martinize

Describe and apply transformation on molecular structures and topologies
Apache License 2.0
84 stars 37 forks source link

The sequence length does not match the number of residues #538

Open DoubleSheep2 opened 10 months ago

DoubleSheep2 commented 10 months ago

I'm trying to create a coarse-grained model for a virus with an atomic model consisting of 60 chains, each having around 500 amino acids. When I use the following command for coarse-graining,

martinize2 -f particle_forCMD_minimize.pdb -o particle_forCMD_minimize.top -x particle_forCMD_minimize_cg.pdb -ff martini3001 -p backbone -maxwarn 1 -mutate HSD:HIS -mutate HSP:HIH -dssp /home/emuser/miniconda2/envs/dssp/bin/mkdssp

I get an error saying, "The sequence length does not match the number of residues. The sequence has 476 elements for 477 residues." This error occurs during the dssp step. I believe this error isn't related to the input model because when I reduced the number of chains, the command worked fine. How can I resolve this issue?

pckroon commented 10 months ago

How many residues do you have in your system exactly? How many residues does DSSP find/annotate if you run it on particle_forCMD_minimize.pdb? If this doesn't shed light, try running with -v. This will preserve any intermediate files, such as the one that we feed to dssp.

DoubleSheep2 commented 10 months ago

The entire virus consists of 60 identical capsid protein monomers, with each chain containing 477 amino acids (aa.129-605), totaling 28,620 amino acids. In debug mode, I inspected the last dssp-generated pdb file (34th) before the error. It appears quite unusual - the 1-33 chains start at position 129 and end at 605, while this specific chain (34th) starts at position 544, goes up to 605, then resets and starts from position 129. The order of the chains in PDB file does not affect the occurrence of the error when processing the 34th chain. Hence, could this be due to the large system size causing the program to encounter issues similar to stack overflow problems? running environment: mkdssp v3.0.0 (conda) and martinize2 v0.9.3 (conda).

pckroon commented 10 months ago

The order of the chains in PDB file does not affect the occurrence of the error when processing the 34th chain. Hence, could this be due to the large system size causing the program to encounter issues similar to stack overflow problems?

No I don't think so. If it did I also think they would show up differently.

Does it work if you remove the afflicted/suspect chain from your input file? Does it have missing atoms in critical spots? What does the dssp output look like if you feed that specific DSSP input file to it?

DoubleSheep2 commented 10 months ago

I think I've identified the cause of the error, which might be related to atom serial number. Due to limitations in the PDB format, atom numbering can't go beyond 99999, and my system has a total of 230,000 atoms. When I adjusted all atom serial numbers to 99999, dssp threw error when processing the 7th chain. However, when I cyclically numbered atoms from 1 to 99999, the error occurred when processing the 42nd chain. Therefore, for larger systems, is there a preprocessing approach that can be employed?

pckroon commented 10 months ago

Hmmn, I know for sure we've seen this issue before, but I can't remember the fix/workaround. How do the atom numbers look for the 42nd chain? It may be a reasonably quick solution to renumber the atoms when writing the PDB for dssp.

DoubleSheep2 commented 10 months ago

Thank you so much for your assistance. I tried numbering each chain's atoms starting from 1, and it resolved the issue. Even in a system with 140 chains, no errors occurred. Hopefully, this solution can help others as well.

pckroon commented 10 months ago

Thanks for confirming that fixes it (and I'm happy you found a workaround). I'll put it on the list to have the DSSP processor renumber atoms before writing the dssp input pdb files.