prody / ProDy

A Python Package for Protein Dynamics Analysis
http://prody.csb.pitt.edu
Other
432 stars 157 forks source link

Evol application. ProDy v 1.10.7 #772

Closed gtzotzos closed 3 years ago

gtzotzos commented 5 years ago

With reference to the instructions given in: http://prody.csb.pitt.edu/tutorials/evol_tutorial/evolapps.html

please note that the command: evol search 2W5IB results in the following error message

@> 2w5i downloaded (2w5i.pdb.gz) @> PDB download via FTP completed (1 downloaded, 0 failed). @> WARNING A UniProt ID code for PDB '2W5IB' could not be parsed. @> Retrieving Pfam search results: https://pfam.xfam.org/protein/2W5IB?output=xml @> Pfam search completed in 4.24s. Traceback (most recent call last): File "/anaconda3/lib/python3.6/site-packages/prody/apps/apptools.py", line 281, in callback func(*args, ns.dict) File "/anaconda3/lib/python3.6/site-packages/prody/apps/evol_apps/evol_search.py", line 90, in evol_search pfam_results = prody.searchPfam(query, kwargs) File "/anaconda3/lib/python3.6/site-packages/prody/database/pfam.py", line 232, in searchPfam results = dictElement(root[0], key) IndexError: child index out of range An exception occurred when executing your command. If this is not a user error, please report it to ProDy developers at: https://bitbucket.org/abakan/prody/issues Error: child index out of range

jamesmkrieger commented 5 years ago

What happened with this issue? Did it go away with upgrading to ProDy 1.10.8?

gtzotzos commented 5 years ago

Thank you James

I’m using ProDy 1.10.8

evol -v ProDy 1.10.8

A recurrent problem is that evol cannot parse PDB files.

Below I’m copying the results of one of my attempts.

A0A1U7F4W2_ANOGA corresponds to PDB structure 3N7H

evol fetch PF01395 @> Pfam MSA for PF01395 is written as PF01395_full.slx.

evol refine PF01395_full.slx -l A0A1U7F4W2_ANOGA -s 0.98 -r 0.8 @> 3084 sequence(s) with 448 residues were parsed in 0.03s. @> Label refinement reduced number of columns from 448 to 109 in 0.00s. @> Row occupancy refinement reduced number of rows from 3084 to 2554 in 0.00s. @> Sequence identity refinement reduced number of rows from 2554 to 2169 in 0.85s. @> Refined MSA is written in file: PF01395_full_refined.slx

evol conserv PF01395_full_refined.slx -S -F png @> 2169 sequence(s) with 109 residues were parsed in 0.02s. @> Cannot parse label start, end values, Setting resnums 1 to 109

evol coevol PF01395_full_refined.slx -S -F png -c apc @> 2169 sequence(s) with 109 residues were parsed in 0.02s. @> Mutual information matrix was calculated in 0.07s. @> Cannot parse label start, end values, Setting resnums 1 to 109 @> Applying 'apc' correction. @> Cannot parse label start, end values, Setting resnums 1 to 109

evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -z -n 20 -p 3N7HA.pdb @> Could not parse PDB, ignoring PDB input @> MSA or PDB not given or does not match mutinfo, using serial indexing @> Residue numbers start and end with 1-109 @> zscore normalization applied such that each column has 0 mean and standard deviation 1

Please note that I get the same result using 3N7HA.pdb or 3N7H_A.pdb, or 3N7H after the -p flag.

On 26 Jul 2019, at 00:45, James Krieger notifications@github.com wrote:

What happened with this issue? Did it go away with upgrading to ProDy 1.10.8?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/prody/ProDy/issues/772?email_source=notifications&email_token=ABXCGKMCMRS7EJ3HB567EXDQBI3IZA5CNFSM4GCVZVX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23CRDA#issuecomment-515254412, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXCGKNKNFN2WTHZEBOXLMLQBI3IZANCNFSM4GCVZVXQ.

jamesmkrieger commented 4 years ago

Again, the problem is that there are no such PDB files.

jamesmkrieger commented 4 years ago

In this case, we also have an issue with aligning that PDB structure to the MSA. I showed an example of how to sort that in my tutorial update and here's another.

$ ipython
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *
   ...: import numpy as np

In [2]: pdb = parsePDB('3n7h')
@> PDB file is found in working directory (3n7h.pdb).
@> 2459 atoms and 1 coordinate set(s) were parsed in 0.04s.

In [3]: msa = parseMSA('PF01395_full_refined.slx')
@> 2284 sequence(s) with 109 residues were parsed in 0.02s.

In [4]: aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa, label='A0A1U7F4W2_ANOGA')
   ...: showAlignment(aln)
3n7h            DTTPRRDAEYPPPELLEALKPLHDICLGKTGVTEEAIKKFSDEEIHEDEKLKCYMNCLFH
A0A1U7F4W2_ANOG --------EYPPPELLEALKPLHDICLGKTGVTEEAIKKFSDEEIHEDEKLKCYMNCLFH

3n7h            EAKVVDDNGDVHLEKLHDSLPSSMHDIAMHMGKRCLYPEGETLCDKAFWLHKCWKQSDPK
A0A1U7F4W2_ANOG EAKVVDDNGDVHLEKLHDSLPSSMHDIAMHMGKRCLYPEGETLCDKAFWLHKCWKQS---

3n7h            HYFLV
A0A1U7F4W2_ANOG -----

In [5]: u, i = np.unique(idx_2, return_index=True)
   ...: rel_i = i[1:]
   ...: rel_i
Out[5]: 
array([  8,   9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,
        21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,
        34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,
        47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
        60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,
        73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,
        86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,
        99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
       112, 113, 114, 115, 116], dtype=int64)

In [6]: resnums_str = ' '.join([str(x) for x in idx_1[rel_i]])
   ...: resnums_str
Out[6]: '9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117'

In [7]: chA_ca = pdb.select('chain A and ca and resnum ' + resnums_str)
   ...: chA_ca.getSequence()
Out[7]: 'EYPPPELLEALKPLHDICLGKTGVTEEAIKKFSDEEIHEDEKLKCYMNCLFHEAKVVDDNGDVHLEKLHDSLPSSMHDIAMHMGKRCLYPEGETLCDKAFWLHKCWKQS'

In [8]: writePDB('3N7HA_9-117', chA_ca)
Out[8]: '3N7HA_9-117.pdb'

This then gives us the following:

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -z -n 20 -p 3N7HA_9-117.pdb
@> 109 atoms and 1 coordinate set(s) were parsed in 0.00s.
@> Residue numbers will be based on pdb: 3N7HA_9-117
@> Residue numbers start and end with 9-117
@> zscore normalization applied such that each column has 0 mean and standard deviation 1
@> use-dist not set, using sequence separation to report coevolving pairs
$ head PF01395_full_refined_mutinfo_corr_apc_rankorder.txt
Label: 3N7HA_9-117      Residue Numbers: 9-117  Sequence Separation:5
Serial  Row     Column  Zscore  Distance
1       104     53      9.495   4.48
2       61      29      3.672   11.09
3       110     71      3.292   8.95
4       89      77      3.240   8.11
5       65      59      3.233   6.21
6       93      73      3.208   10.56
7       79      63      2.982   9.05
8       61      25      2.980   8.22
jamesmkrieger commented 4 years ago

It could be good to incorporate the alignSequenceToMSA part into evol too as part of #1138

jamesmkrieger commented 4 years ago

I now have a new function trimAtomsUsingMSA, which does this:

$ ipython
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *

In [2]: pdb = parsePDB('2W5I')
@> PDB file is found in working directory (2w5i.pdb.gz).
@> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s.

In [3]: fetchPfamMSA('PF00074')
@> Pfam MSA for PF00074 is written as .\PF00074_full.sth.
Out[3]: '.\\PF00074_full.sth'

In [4]: msa = parseMSA('PF00074_full.sth')
@> 1187 sequence(s) with 316 residues were parsed in 0.06s.

In [5]: msa_refine = refineMSA(msa, label='RNAS1_BOVIN', seqid=0.98, colocc=0.7)
@> Label refinement reduced number of columns from 316 to 119 in 0.00s.
@> Sequence identity refinement reduced number of rows from 1187 to 995 in 0.13s.
@> Column occupancy refinement reduced number of columns from 119 to 106 in 0.01s.

In [6]: aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1)
@> PDB file is found in working directory (2w5i.pdb.gz).
@> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> UniProt idcode RNAS1_BOVIN for 2W5IB is found in MSA PF00074_full refined (label=RNAS1_BOVIN, seqid>=0.98, colocc>=0.7).     

In [7]: showAlignment(aln, indices=[idx_1, idx_2])

                                  20        30        40        50        60
2W5I            KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ

                                  16        23        33        43        53
RNAS1_BOVIN     --T-AAKFERQHMDSSTS-A---NYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ

                        70        80        90       100       110       120
2W5I            KNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF

                        61        71        81        91       101       106
RNAS1_BOVIN     KNVAC--GQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACE----VPVH-

2W5I            DASV

RNAS1_BOVIN     ----

In [8]: chB_trimmed = trimAtomsUsingMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1)
@> PDB file is found in working directory (2w5i.pdb.gz).
@> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> UniProt idcode RNAS1_BOVIN for 2W5IB is found in MSA PF00074_full refined (label=RNAS1_BOVIN, seqid>=0.98, colocc>=0.7).     

In [9]: chB_trimmed.ca.getResnums()
Out[9]: 
array([  3,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,  16,
        17,  18,  20,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,
        34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,
        47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
        60,  61,  62,  63,  64,  65,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 116, 117,
       118, 119])

In [10]: chB_trimmed.ca.getSequence()
Out[10]: 'TAAKFERQHMDSSTSANYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEVPVH'
jamesmkrieger commented 4 years ago

I have now fixed it completely I think so I'm closing the issue.

gtzotzos commented 4 years ago

James,

Below is a repeat of your process. You’ll note a warning in ln [1] and and error in ln [6]. The argument chain is not recognised.

Last login: Wed Sep 9 13:48:36 on ttys001 /Users/george/.zshrc:1: command not found: $ (base) george@Georges-MacBook-Pro-2 ~ % ipython Python 3.7.6 (default, Jan 8 2020, 13:42:34) Type 'copyright', 'credits' or 'license' for more information IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *
/opt/anaconda3/lib/python3.7/site-packages/Bio/SubsMat/init.py:131: BiopythonDeprecationWarning: Bio.SubsMat has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.substitution_matrices as a replacement, and contact the Biopython developers if you still need the Bio.SubsMat module. BiopythonDeprecationWarning,

In [2]: pdb = parsePDB('2W5I')
@> Connecting wwPDB FTP server RCSB PDB (USA). @> 2w5i downloaded (2w5i.pdb.gz) @> PDB download via FTP completed (1 downloaded, 0 failed). @> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s.

In [3]: fetchPfamMSA('PF00074')
@> Pfam MSA for PF00074 is written as PF00074_full.sth. Out[3]: 'PF00074_full.sth'

In [4]: msa = parseMSA('PF00074_full.sth')
@> 1187 sequence(s) with 316 residues were parsed in 0.01s.

In [5]: msa_refine = refineMSA(msa, label='RNAS1_BOVIN', seqid=0.98, colocc=0.7)
@> Label refinement reduced number of columns from 316 to 119 in 0.00s. @> Sequence identity refinement reduced number of rows from 1187 to 995 in 0.17s. @> Column occupancy refinement reduced number of columns from 119 to 106 in 0.00s.

In [6]: aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1)

TypeError Traceback (most recent call last)

in ----> 1 aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1) TypeError: alignSequenceToMSA() got an unexpected keyword argument 'chain' In [7]: > On 3 Sep 2020, at 17:35, James Krieger wrote: > > > I now have a new function trimAtomsUsingMSA, which does this: > > $ ipython > Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)] > Type 'copyright', 'credits' or 'license' for more information > IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help. > > In [1]: from prody import * > > In [2]: pdb = parsePDB('2W5I') > @> PDB file is found in working directory (2w5i.pdb.gz). > @> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s. > > In [3]: fetchPfamMSA('PF00074') > @> Pfam MSA for PF00074 is written as .\PF00074_full.sth. > Out[3]: '.\\PF00074_full.sth' > > In [4]: msa = parseMSA('PF00074_full.sth') > @> 1187 sequence(s) with 316 residues were parsed in 0.06s. > > In [5]: msa_refine = refineMSA(msa, label='RNAS1_BOVIN', seqid=0.98, colocc=0.7) > @> Label refinement reduced number of columns from 316 to 119 in 0.00s. > @> Sequence identity refinement reduced number of rows from 1187 to 995 in 0.13s. > @> Column occupancy refinement reduced number of columns from 119 to 106 in 0.01s. > > In [6]: aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1) > @> PDB file is found in working directory (2w5i.pdb.gz). > @> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s. > @> UniProt idcode RNAS1_BOVIN for 2W5IB is found in MSA PF00074_full refined (label=RNAS1_BOVIN, seqid>=0.98, colocc>=0.7). > > In [7]: showAlignment(aln, indices=[idx_1, idx_2]) > > 20 30 40 50 60 > 2W5I KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ > > 16 23 33 43 53 > RNAS1_BOVIN --T-AAKFERQHMDSSTS-A---NYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ > > > 70 80 90 100 110 120 > 2W5I KNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF > > 61 71 81 91 101 106 > RNAS1_BOVIN KNVAC--GQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACE----VPVH- > > > > 2W5I DASV > > > RNAS1_BOVIN ---- > > > In [8]: chB_trimmed = trimAtomsUsingMSA(pdb, msa_refine, chain='B', mismatch=-2, gap_opening=-1) > @> PDB file is found in working directory (2w5i.pdb.gz). > @> 1921 atoms and 1 coordinate set(s) were parsed in 0.02s. > @> UniProt idcode RNAS1_BOVIN for 2W5IB is found in MSA PF00074_full refined (label=RNAS1_BOVIN, seqid>=0.98, colocc>=0.7). > > In [9]: chB_trimmed.ca.getResnums() > Out[9]: > array([ 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, > 17, 18, 20, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, > 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, > 60, 61, 62, 63, 64, 65, 68, 69, 70, 71, 72, 73, 74, > 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, > 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, > 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 116, 117, > 118, 119]) > > In [10]: chB_trimmed.ca.getSequence() > Out[10]: 'TAAKFERQHMDSSTSANYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEVPVH' > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub , or unsubscribe . >
jamesmkrieger commented 4 years ago

The warning in line 1 is not a problem for now. It will be a problem for later biopython versions, but I have made a fix for that on the github version of ProDy by removing that import. We are planning to make a new release soon, which will then not have this error.

For line 6, you can't use the chain keyword with the released version of ProDy. I added it recently on GitHub too. The solution if you want to use your current ProDy would be to add chain='B' to line 2 instead.

I'd recommend you to use the latest development version from GitHub where I have made all these fixes and improvements, but there are things you can do in the released version from PyPI/pip too. Basically, the answer is to look at the residue numbers in the AtomGroup and see which residues are missing in the MSA using alignSequenceToMSA and showAlignment or just printing out the two sequences. For 2W5I chain B, the AtomGroup residues numbers start at 1 and the first two residues are missing so we select starting at 3. Then the last three are missing too, ending the sequence at residue 121 for selection. You can see a better explanation on http://prody.csb.pitt.edu/test_prody/_build/html/tutorials/evol_tutorial/comparison.html - there were some problems with building it for some reason but it's still readable I think.

Best wishes James

gtzotzos commented 4 years ago

Hi James,

Many thanks for the clarifications and fixes. I’ve tried the suggested procedure(s) on different PDB structures. The procedure worked in a number of case but failed in others. In the following detailed description you will see where things go wrong.

Best regards

George

25 Sept 2020

PDB = 3L4A
PFAM alignment in PF01395_full.sth

=GS Q7PGA3_ANOGA/17-131 DR PDB; 3L4A A; 25-131;

Generate pdb file for input in the evol app
See: http://prody.csb.pitt.edu/test_prody/_build/html/tutorials/evol_tutorial/comparison.html
See: https://github.com/prody/ProDy/pull/1138

(newEnv) george@Georges-MacBook-Pro-2 3l4a % ipython Python 2.7.15 (default, Jan 12 2019, 21:07:57)

In [1]: from prody import *

In [2]: from matplotlib.pylab import *

In [3]: import numpy as np

In [4]: msa = parseMSA('PF01395_full.slx') @> 3223 sequence(s) with 451 residues were parsed in 0.04s.

In [5]: msa_refine = refineMSA(msa, label='Q7PGA3_ANOGA', seqid=0.98) @> Label refinement reduced number of columns from 451 to 115 in 0.00s. @> Sequence identity refinement reduced number of rows from 3223 to 2790 in 3.04s.

In [6]: pdb = parsePDB('3l4a', chain='A') @> Connecting wwPDB FTP server RCSB PDB (USA). @> 3l4a downloaded (3l4a.pdb.gz) @> PDB download via FTP completed (1 downloaded, 0 failed). @> 1120 atoms and 1 coordinate set(s) were parsed in 0.03s.

In [7]: aln, idx_1, idx_2 = alignSequenceToMSA(pdb, msa_refine, label='Q7PGA3_ANOG')

In [8]: showAlignment(aln, indices=[idx_1, idx_2])

                              12        22        32        42        52

3l4aA --------NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDK

                              20        30        40        50        60

Q7PGA3_ANOG QFVTAADNNESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDK

                    62        72        82        92       102       112

3l4aA YGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLREDYEMYQ

                    70        80        90       100       110       115

Q7PGA3_ANOG YGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED-----

3l4aA NNNELWSHP

Q7PGA3_ANOG ---------

In [9]: u, i = np.unique(idx_2, return_index=True)

In [10]: rel_i = i[1:]

In [11]: rel_i Out[11]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114])

In [12]: resnums_str = ' '.join([str(x) for x in idx_1[rel_i]])

In [13]: resnums_str Out[13]: '0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107'

107 residues aligned. The pdb output has 83 residues

In [14]: chA_ca = pdb.select('chain A and ca and resnum ' + resnums_str)

In [15]: chA_ca.getSequence() Out[15]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLE'

In [16]: len(chA_ca) Out[16]: 83

In [16]: writePDB('3L4A_1-83', chA_ca) Out[16]: '3L4A_1-83.pdb'

Using the Evol app.

% evol refine PF01395_full.slx -l Q7PGA3_ANOGA -s 0.98 -r 0.8 @> 3223 sequence(s) with 451 residues were parsed in 0.01s. @> Label refinement reduced number of columns from 451 to 115 in 0.00s. @> Row occupancy refinement reduced number of rows from 3223 to 2694 in 0.00s. @> Sequence identity refinement reduced number of rows from 2694 to 2299 in 0.66s. @> Refined MSA is written in file: PF01395_full_refined.slx

% evol conserv PF01395_full_refined.slx -S @> 2299 sequence(s) with 115 residues were parsed in 0.01s. @> Cannot parse label start, end values, Setting resnums 1 to 115

% evol coevol PF01395_full_refined.slx -S -F png -c apc --cmin 0.0 @> 2299 sequence(s) with 115 residues were parsed in 0.01s. @> Mutual information matrix was calculated in 0.04s. @> Cannot parse label start, end values, Setting resnums 1 to 115 @> Applying 'apc' correction. @> Cannot parse label start, end values, Setting resnums 1 to 115

% evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -p 3L4A_1-83.pdb @> 83 atoms and 1 coordinate set(s) were parsed in 0.00s. @> Number of residues in PDB does not match mutinfo matrix, ignoring PDB input @> MSA or PDB not given or does not match mutinfo, using serial indexing @> Residue numbers start and end with 1-115

On 24 Sep 2020, at 18:17, James Krieger notifications@github.com wrote:

The warning in line 1 is not a problem for now. It will be a problem for later biopython versions, but I have made a fix for that on the github version of ProDy by removing that import. We are planning to make a new release soon, which will then not have this error.

For line 6, you can't use the chain keyword with the released version of ProDy. I added it recently on GitHub too. The solution if you want to use your current ProDy would be to add chain='B' to line 2 instead.

I'd recommend you to use the latest development version from GitHub where I have made all these fixes and improvements, but there are things you can do in the released version from PyPI/pip too. Basically, the answer is to look at the residue numbers in the AtomGroup and see which residues are missing in the MSA using alignSequenceToMSA and showAlignment or just printing out the two sequences. For 2W5I chain B, the AtomGroup residues numbers start at 1 and the first two residues are missing so we select starting at 3. Then the last three are missing too, ending the sequence at residue 121 for selection. You can see a better explanation on http://prody.csb.pitt.edu/test_prody/_build/html/tutorials/evol_tutorial/comparison.html http://prody.csb.pitt.edu/test_prody/_build/html/tutorials/evol_tutorial/comparison.html - there were some problems with building it for some reason but it's still readable I think.

Best wishes James

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/prody/ProDy/issues/772#issuecomment-698476490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXCGKICPU4GV7JVENY4V5LSHN5JFANCNFSM4GCVZVXQ.

jamesmkrieger commented 4 years ago

Hi George,

Firstly, I'll address the command line part where it looks like the rel_i part didn't work as planned. I'll make another comment on the application part afterwards.

If you look at the output of showAlignment, the common sequence between the MSA and PDB is the following

NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED

and you came up with the following

NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLE

I think the problem here is the way you're indexing, which doesn't account for having gaps in both the PDB and MSA.

Here's how I'd do it instead:

First, we need to generate a set of relevant indices rel_i that takes account of the gaps at the beginning of the PDB sequence, which means we need to get the indices for the unique entries in idx_1 and skip the first index, which corresponds to the repeated empty position 24:

In [9]: idx_1
Out[9]: 
array([ 24,  24,  24,  24,  24,  24,  24,  24,  25,  26,  27,  28,  29,
        30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,
        69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
        82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
        95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107,
       108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
       121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145])

In [10]: u1, i1 = np.unique(idx_1, return_index=True)

In [11]: u1, i1
Out[11]: 
(array([ 24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,
         37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,
         50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
         63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,
         76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,
         89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101,
        102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
        115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
        128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
        141, 142, 143, 144, 145]),
 array([  0,   8,   9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,
         20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32,
         33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,
         46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,
         59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,
         72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,
         85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110,
        111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
        124, 125, 126, 127, 128], dtype=int64))

In [12]: rel_i1 = i1[1:]

In [13]: rel_i1
Out[13]: 
array([  8,   9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,
        21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,
        34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,
        47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
        60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,
        73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,
        86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,
        99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
       112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
       125, 126, 127, 128], dtype=int64)

In [14]: resnums_str11 = ' '.join([str(x) for x in idx_1[rel_i1]])

In [15]: resnums_str11
Out[15]: '25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145'

Then, we need to clip off the residues at the end, which are gapped in the MSA, which is where idx_2 comes in. We see that it has a load of position 115 at the end, which we need to crop off by taking the relevant unique indices from there:

In [16]: idx_2
Out[16]: 
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 115, 115,
       115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115])

In [17]: u2, i2 = np.unique(idx_2, return_index=True)

In [18]: u2, i2
Out[18]: 
(array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
         27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
         40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
         53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
         79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
         92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
        105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115]),
 array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114], dtype=int64))

I think we way to apply this is to apply it to indx_1 before calculating the final rel_i:

In [56]: idx_1_cut = idx_1[i2]

In [57]: idx_1
Out[57]: 
array([ 24,  24,  24,  24,  24,  24,  24,  24,  25,  26,  27,  28,  29,
        30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,
        69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
        82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
        95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107,
       108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
       121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145])

In [58]: idx_1_cut
Out[58]: 
array([ 24,  24,  24,  24,  24,  24,  24,  24,  25,  26,  27,  28,  29,
        30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,
        69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
        82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
        95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107,
       108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
       121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131])

At this point, we see that idx_1_cut ends at 131 like we expect from DR PDB; 3L4A A; 25-131. The last step to get it to start at 25 is to index it with i1[1:] but taking into account the stopping point we generated:

In [66]: rel_i = i1[1:][np.where(i1[1:] < len(idx_1_cut))]

In [67]: rel_i
Out[67]: 
array([  8,   9,  10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,
        21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,
        34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,
        47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,
        60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,
        73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,
        86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,
        99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
       112, 113, 114], dtype=int64)

In [68]: resnums_str = ' '.join([str(x) for x in idx_1[rel_i]])

In [69]: resnums_str
Out[69]: '25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131'

In [70]: chA_ca = pdb.select('chain A and ca and resnum ' + resnums_str)

In [71]: chA_ca.getSequence()
Out[71]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED'

If you could test this approach on another system that you tested before and see that it works there too, I'd really appreciate that.

Best wishes James

jamesmkrieger commented 4 years ago

As for the evol app part, there are two reasons why that didn't work.

Firstly, the PDB was wrong as the approach above was wrong. You could have got the right answer from DR PDB; 3L4A A; 25-131 as I mentioned above anyway.

I think that still wouldn't probably work though as you have a part of the MSA that is lacking from the PDB. The way to deal with that would be to refine the MSA with the PDB ID, which then makes it have the same sequence as the sliced PDB from ealier as follows:

In [73]: msa_refine2 = refineMSA(msa, label='3l4aA', seqid=0.98)
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in chain MSA PF01395_full.
@> Label refinement reduced number of columns from 451 to 115 in 0.02s.
@> Structure refinement reduced number of columns from 115 to 107 in 0.02s.
@> Sequence identity refinement reduced number of rows from 3223 to 2775 in 0.82s.

In [74]: msa_refine2.getIndex('Q7PGA3_ANOG')
Out[74]: 1453

In [75]: msaseq = msa_refine2[1453]

In [76]: str(msaseq)
Out[76]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED'

In [77]: chA_ca.getSequence()
Out[77]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED'

We can therefore, write this PDB file and be happy with it.

In [78]: writePDB('3l4aA_25-131', chA_ca)
Out[78]: '3l4aA_25-131.pdb

We can then use this with evol as follows:

$ evol refine PF01395_full.sth -l 1l4aA -s 0.98 -r 0.8
@> 3223 sequence(s) with 451 residues were parsed in 0.02s.
@> PDB file is found in working directory (1l4a.pdb.gz).
@> 2381 atoms and 1 coordinate set(s) were parsed in 0.02s.
@> Secondary structures were assigned to 283 residues.
Traceback (most recent call last):
  File "C:\Users\james\.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\apps\apptools.py", line 281, in callback
    func(*args, **ns.__dict__)
  File "C:\Users\james\.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\apps\evol_apps\evol_refine.py", line 95, in evol_refine    
    writeMSA(outname, refineMSA(parseMSA(msa), **kwargs), **kwargs)
  File "C:\Users\james\.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\sequence\msa.py", line 528, in refineMSA
    raise ValueError('label is not in msa, or msa is not indexed')
ValueError: label is not in msa, or msa is not indexed
An exception occurred when executing your command.  If this is not a user error, please report it to ProDy developers at: https://github.com/prody/ProDy/issues
Error: label is not in msa, or msa is not indexed

I have no clue what's wrong here, but I'll work it out another time. For now, we can execute the equivalent commands from Python:

Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *

In [2]: from os.path import splitext

In [3]: msa = 'PF01395_full.sth'

In [4]: outname, ext = splitext(msa)

In [5]: outname += '_refined' + ext

In [6]: kwargs = {'label':'3l4aA', 'seqid':0.98, 'rowocc':0.8}

In [7]: writeMSA(outname, refineMSA(parseMSA(msa), **kwargs), **kwargs)
@> 3223 sequence(s) with 451 residues were parsed in 0.02s.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in chain MSA PF01395_full.
@> Label refinement reduced number of columns from 451 to 115 in 0.04s.
@> Structure refinement reduced number of columns from 115 to 107 in 0.01s.
@> Row occupancy refinement reduced number of rows from 3223 to 2729 in 0.00s.
@> Sequence identity refinement reduced number of rows from 2729 to 2324 in 0.64s.
Out[7]: 'PF01395_full_refined.sth'

Then we continue as before:

$ evol conserv PF01395_full_refined.slx -S
@> 2324 sequence(s) with 107 residues were parsed in 0.02s.
@> Label start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107

We find that the resnum setting doesn't work because the refine step didn't update the residue numbers in the label. It still expects 115 residues based on the numbers in the label rather than 107 because it doesn't account for the 8 we removed that were lacking in the PDB.

Looking at the function refineMSA to see if I can account for that, I see that the PDB refinement is running along the same way as what we're trying to do with alignSequenceToMSA but on the MSA side:

            if chain is not None and not kwargs.get('keep', False):
                before = arr.shape[1]
                LOGGER.timeit('_refine')

                from Bio import pairwise2
                from prody.utilities import MATCH_SCORE, MISMATCH_SCORE
                from prody.utilities import GAP_PENALTY, GAP_EXT_PENALTY, ALIGNMENT_METHOD

                chseq = chain.getSequence()
                algn = pairwise2.align.localms(pystr(arr[index].tostring().upper()), pystr(chseq),
                                         MATCH_SCORE, MISMATCH_SCORE,
                                         GAP_PENALTY, GAP_EXT_PENALTY,
                                         one_alignment_only=1)
                torf = []
                for s, c in zip(*algn[0][:2]):
                    if s == '-':
                        continue
                    elif c != '-':
                        torf.append(True)
                    else:
                        torf.append(False)
                torf = array(torf)
                tsum = torf.sum()
                assert tsum <= before, 'problem in mapping sequence to structure'
                if tsum < before:
                    arr = arr.take(torf.nonzero()[0], 1)
                    LOGGER.report('Structure refinement reduced number of '
                                  'columns from {0} to {1} in %.2fs.'
                                  .format(before, arr.shape[1]), '_refine')
                else:
                    LOGGER.debug('All residues in the sequence are contained in '
                                 'PDB structure {0}.'.format(label))

I now have a potential fix for that at #1184.

After that, it then gives me the following:

$ evol conserv PF01395_full_refined.slx -S
@> 2324 sequence(s) with 107 residues were parsed in 0.01s.
@> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107
@> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132

$ evol coevol PF01395_full_refined.slx -S -F png -c apc --cmin 0.0
@> 2324 sequence(s) with 107 residues were parsed in 0.01s.
@> Mutual information matrix was calculated in 0.03s.
@> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107
@> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132
@> Applying 'apc' correction.
@> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107
@> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -p 3l4aA_25-131.pdb
@> 107 atoms and 1 coordinate set(s) were parsed in 0.00s.
@> Secondary structures were assigned to 86 residues.
@> Residue numbers will be based on pdb: 3l4aA_25-131
@> Residue numbers start and end with 25-131
@> use-dist not set, using sequence separation to report coevolving pairs

How's that?

Best wishes James

jamesmkrieger commented 4 years ago

So, there are two things left to work out:

  1. How to automate the selection of the PDB residues, which is obviously suboptimal. The idea would be to rearrange together refineMSA, alignSequenceToMSA and trimSequenceUsingMSA. The function alignSequenceToMSA would become the backend of both of the other two and incorporate the key parts of refineMSA, which it has already started doing.

  2. Why refineMSA can accept a PDB ID + chain ID but evol refine doesn't and how to fix that.

gtzotzos commented 4 years ago

Hi James,

I’m referring to the potential fix for that at #1184 https://github.com/prody/ProDy/pull/1184

Could you please be more explicit about this fix? I failed to see it.

Many thanks

Cheers

George

On 2 Oct 2020, at 17:41, James Krieger notifications@github.com wrote:

As for the evol app part, there are two reasons why that didn't work.

Firstly, the PDB was wrong as the approach above was wrong. You could have got the right answer from DR PDB; 3L4A A; 25-131 as I mentioned above anyway.

I think that still wouldn't probably work though as you have a part of the MSA that is lacking from the PDB. The way to deal with that would be to refine the MSA with the PDB ID, which then makes it have the same sequence as the sliced PDB from ealier as follows:

In [73]: msa_refine2 = refineMSA(msa, label='3l4aA', seqid=0.98) @> PDB file is found in working directory (3l4a.pdb.gz). @> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s. @> Secondary structures were assigned to 94 residues. @> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in chain MSA PF01395_full. @> Label refinement reduced number of columns from 451 to 115 in 0.02s. @> Structure refinement reduced number of columns from 115 to 107 in 0.02s. @> Sequence identity refinement reduced number of rows from 3223 to 2775 in 0.82s.

In [74]: msa_refine2.getIndex('Q7PGA3_ANOG') Out[74]: 1453

In [75]: msaseq = msa_refine2[1453]

In [76]: str(msaseq) Out[76]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED'

In [77]: chA_ca.getSequence() Out[77]: 'NESVIESCSNAVQGAANDELKVHYRANEFPDDPVTHCFVRCIGLELNLYDDKYGVDLQANWENLGNSDDADEEFVAKHRACLEAKNLETIEDLCERAYSAFQCLRED' We can therefore, write this PDB file and be happy with it.

In [78]: writePDB('3l4aA_25-131', chA_ca) Out[78]: '3l4aA_25-131.pdb We can then use this with evol as follows:

$ evol refine PF01395_full.sth -l 1l4aA -s 0.98 -r 0.8 @> 3223 sequence(s) with 451 residues were parsed in 0.02s. @> PDB file is found in working directory (1l4a.pdb.gz). @> 2381 atoms and 1 coordinate set(s) were parsed in 0.02s. @> Secondary structures were assigned to 283 residues. Traceback (most recent call last): File "C:\Users\james.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\apps\apptools.py", line 281, in callback func(*args, ns.dict) File "C:\Users\james.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\apps\evol_apps\evol_refine.py", line 95, in evol_refine
writeMSA(outname, refineMSA(parseMSA(msa),
kwargs), **kwargs) File "C:\Users\james.conda\envs\prody\lib\site-packages\prody-1.11-py3.7-win-amd64.egg\prody\sequence\msa.py", line 528, in refineMSA raise ValueError('label is not in msa, or msa is not indexed') ValueError: label is not in msa, or msa is not indexed An exception occurred when executing your command. If this is not a user error, please report it to ProDy developers at: https://github.com/prody/ProDy/issues Error: label is not in msa, or msa is not indexed I have no clue what's wrong here, but I'll work it out another time. For now, we can execute the equivalent commands from Python:

Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *

In [2]: from os.path import splitext

In [3]: msa = 'PF01395_full.sth'

In [4]: outname, ext = splitext(msa)

In [5]: outname += '_refined' + ext

In [6]: kwargs = {'label':'3l4aA', 'seqid':0.98, 'rowocc':0.8}

In [7]: writeMSA(outname, refineMSA(parseMSA(msa), kwargs), kwargs) @> 3223 sequence(s) with 451 residues were parsed in 0.02s. @> PDB file is found in working directory (3l4a.pdb.gz). @> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s. @> Secondary structures were assigned to 94 residues. @> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in chain MSA PF01395_full. @> Label refinement reduced number of columns from 451 to 115 in 0.04s. @> Structure refinement reduced number of columns from 115 to 107 in 0.01s. @> Row occupancy refinement reduced number of rows from 3223 to 2729 in 0.00s. @> Sequence identity refinement reduced number of rows from 2729 to 2324 in 0.64s. Out[7]: 'PF01395_full_refined.sth' Then we continue as before:

$ evol conserv PF01395_full_refined.slx -S @> 2324 sequence(s) with 107 residues were parsed in 0.02s. @> Label start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107 We find that the resnum setting doesn't work because the refine step didn't update the residue numbers in the label. It still expects 115 residues based on the numbers in the label rather than 107 because it doesn't account for the 8 we removed that were lacking in the PDB.

Looking at the function refineMSA to see if I can account for that, I see that the PDB refinement is running along the same way as what we're trying to do with alignSequenceToMSA but on the MSA side:

        if chain is not None and not kwargs.get('keep', False):
            before = arr.shape[1]
            LOGGER.timeit('_refine')

            from Bio import pairwise2
            from prody.utilities import MATCH_SCORE, MISMATCH_SCORE
            from prody.utilities import GAP_PENALTY, GAP_EXT_PENALTY, ALIGNMENT_METHOD

            chseq = chain.getSequence()
            algn = pairwise2.align.localms(pystr(arr[index].tostring().upper()), pystr(chseq),
                                     MATCH_SCORE, MISMATCH_SCORE,
                                     GAP_PENALTY, GAP_EXT_PENALTY,
                                     one_alignment_only=1)
            torf = []
            for s, c in zip(*algn[0][:2]):
                if s == '-':
                    continue
                elif c != '-':
                    torf.append(True)
                else:
                    torf.append(False)
            torf = array(torf)
            tsum = torf.sum()
            assert tsum <= before, 'problem in mapping sequence to structure'
            if tsum < before:
                arr = arr.take(torf.nonzero()[0], 1)
                LOGGER.report('Structure refinement reduced number of '
                              'columns from {0} to {1} in %.2fs.'
                              .format(before, arr.shape[1]), '_refine')
            else:
                LOGGER.debug('All residues in the sequence are contained in '
                             'PDB structure {0}.'.format(label))

I now have a potential fix for that at #1184 https://github.com/prody/ProDy/pull/1184.

After that, it then gives me the following:

$ evol conserv PF01395_full_refined.slx -S @> 2324 sequence(s) with 107 residues were parsed in 0.01s. @> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107 @> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132

$ evol coevol PF01395_full_refined.slx -S -F png -c apc --cmin 0.0 @> 2324 sequence(s) with 107 residues were parsed in 0.01s. @> Mutual information matrix was calculated in 0.03s. @> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107 @> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132 @> Applying 'apc' correction. @> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107 @> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -p 3l4aA_25-131.pdb @> 107 atoms and 1 coordinate set(s) were parsed in 0.00s. @> Secondary structures were assigned to 86 residues. @> Residue numbers will be based on pdb: 3l4aA_25-131 @> Residue numbers start and end with 25-131 @> use-dist not set, using sequence separation to report coevolving pairs How's that?

Best wishes James

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/prody/ProDy/issues/772#issuecomment-702837340, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXCGKM2JNDN63T4IBJZJ6LSIX7FBANCNFSM4GCVZVXQ.

jamesmkrieger commented 4 years ago

Hi George,

Evol uses a function pickSequence to pick a sequence from the MSA that has the right number of residues with no gaps as would be obtained by label refinement with the name of a sequence from the MSA. When we provide a label containing a PDB ID and chain, then it performs structure refinement and removes extra residues that are missing from the PDB structure, but it previously wouldn't update the sequence label to account for that, so the start and stop residue numbers in the label would be wrong and evol would instead reset the numbers from 1 to the number of residues. The change in the first commit is therefore to read out the residue numbers from the PDB structure and selects the ones from them that match the alignment and then uses those to generate a sequence label.

However, even then pickSequence may find multiple rows in the alignment with the right number of residues without gaps and previously it would just pick the first one it came across and complain that the label doesn't have the right residue numbers. The next commit therefore modifies pickSequence and findResnums to have a score for matching the label residue numbers so that it can optionally try to pick the one with a match. If no match is found then the last sequence with the right number of residues is returned rather than the first one, which probably doesn't make much difference.

I then also made it so that showShannonEntropy for evol conserv and showMutInfoMatrix for evol coevol use that new option of looking for the matching one rather than taking the first one and reporting no match.

This then gives the following output that I showed before:

@> Label A0A182HX68_ANOAR start-end entry does not match length of ungapped sequence. Setting resnums 1 to 107
@> Label Q7PGA3_ANOGA start-end entry matches length of ungapped sequence. Setting resnums 26 to 132
jamesmkrieger commented 4 years ago

Oh, I also haven't merged that fix yet as I wanted to sort out the remaining issues. However, seeing as I'm not going to sort them out any time that soon, I've now merged it.

Best wishes James

jamesmkrieger commented 3 years ago
2. Why `refineMSA` can accept a PDB ID + chain ID but `evol refine` doesn't and how to fix that.

Actually refineMSA has the same problem with that too so that should be easier to work out:

$ evol refine PF01395_full.sth -l 3N7HA -s 0.98 -r 0.8
@> 3223 sequence(s) with 451 residues were parsed in 0.02s.
@> PDB file is found in working directory (3n7h.pdb.gz).
@> 2459 atoms and 1 coordinate set(s) were parsed in 0.03s.
@> Secondary structures were assigned to 170 residues.
Traceback (most recent call last):
  File "C:\Users\james\code\ProDy\prody\apps\apptools.py", line 281, in callback
    func(*args, **ns.__dict__)
  File "C:\Users\james\code\ProDy\prody\apps\evol_apps\evol_refine.py", line 95, in evol_refine
    writeMSA(outname, refineMSA(parseMSA(msa), **kwargs), **kwargs)
  File "C:\Users\james\code\ProDy\prody\sequence\msa.py", line 529, in refineMSA
    raise ValueError('label is not in msa, or msa is not indexed')
ValueError: label is not in msa, or msa is not indexed
An exception occurred when executing your command.  If this is not a user error, please report it to ProDy developers at: https://github.com/prody/ProDy/issues 
Error: label is not in msa, or msa is not indexed
$ ipython
Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from prody import *

In [2]: msa = parseMSA('PF01395_full.sth')
@> 3223 sequence(s) with 451 residues were parsed in 0.03s.

In [3]: msa_ref = refineMSA(msa, label='3N7HA', seqid=0.98, rowocc=0.8)
@> PDB file is found in working directory (3n7h.pdb.gz).
@> 2459 atoms and 1 coordinate set(s) were parsed in 0.04s.
@> Secondary structures were assigned to 170 residues.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-5786de7c0d64> in <module>
----> 1 msa_ref = refineMSA(msa, label='3N7HA', seqid=0.98, rowocc=0.8)

~\code\ProDy\prody\sequence\msa.py in refineMSA(msa, index, label, rowocc, seqid, colocc, **kwargs)
    527
    528             if index is None:
--> 529                 raise ValueError('label is not in msa, or msa is not indexed')
    530             try:
    531                 len(index)

ValueError: label is not in msa, or msa is not indexed

We also still have the issue with evol rankorder not accepting the PDB.

jamesmkrieger commented 3 years ago

The problem appears to be that the DBREF entry in the PDB file only includes Q8T6S0 and Q8T6S0_ANOGA, which are not in the MSA labels, even though https://www.rcsb.org/structure/3n7h lists A0A1U7F4W2, which is. In this case, this is not a problem then.

jamesmkrieger commented 3 years ago

The refinement does work when the PDB file maps to the labels correctly as is the case for 3l4aA:

$ evol refine PF01395_full.sth -l 3l4aA -s 0.98 -r 0.8
@> 3223 sequence(s) with 451 residues were parsed in 0.02s.
@> Connecting wwPDB FTP server RCSB PDB (USA).
@> 3l4a downloaded (C:\Users\james\...\3l4a.pdb.gz)
@> PDB download via FTP completed (1 downloaded, 0 failed).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in chain MSA PF01395_full.
@> Label refinement reduced number of columns from 451 to 115 in 2.13s.
@> Structure refinement reduced number of columns from 115 to 107 in 0.01s.
@> Row occupancy refinement reduced number of rows from 3223 to 2729 in 0.02s.
@> Sequence identity refinement reduced number of rows from 2729 to 2324 in 0.65s.
@> Refined MSA is written in file: PF01395_full_refined.sth

The issue with the rankorder command is still unsolved though.

jamesmkrieger commented 3 years ago

rankorder does work if given a label with -l:

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -z -n 20 -p 3l4aA -m PF01395_full_refined.sth -l Q7PGA3_ANOGA
@> 2324 sequence(s) with 107 residues were parsed in 0.01s.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in MSA PF01395_full_refined.
@> Residue numbers will be based on MSA and label: Q7PGA3_ANOGA
@> Residue numbers start and end with 26-132
@> zscore normalization applied such that each column has 0 mean and standard deviation 1

We just need to work out why it can't get it itself.

Without giving it a label it picks the wrong one and has a problem:

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -z -n 20 -p 3l4aA -m PF01395_full_refined.sth
@> 2324 sequence(s) with 107 residues were parsed in 0.01s.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.00s.
@> Secondary structures were assigned to 94 residues.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in MSA PF01395_full_refined.
@> Label: A0A182HX68_ANOAR/15-135 and mutinfo do not have similar no of residues, using serial indexing
@> Residue numbers start and end with 1-107
@> zscore normalization applied such that each column has 0 mean and standard deviation 1
jamesmkrieger commented 3 years ago

Ok, I found the problem was with trimAtomsUsingMSA and I fixed it in #1286, which will be merged soon.

Now rankorder works too:

$ evol rankorder PF01395_full_refined_mutinfo_corr_apc.txt -q 5 -z -n 20 -p 3l4aA -m PF01395_full_refined.sth
@> 2324 sequence(s) with 107 residues were parsed in 0.01s.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> PDB file is found in working directory (3l4a.pdb.gz).
@> 1120 atoms and 1 coordinate set(s) were parsed in 0.01s.
@> Secondary structures were assigned to 94 residues.
@> UniProt idcode Q7PGA3_ANOGA for 3l4aA is found in MSA PF01395_full_refined.
@> Residue numbers will be based on pdb: 3l4aA
@> Residue numbers start and end with 25-131
@> zscore normalization applied such that each column has 0 mean and standard deviation 1
@> use-dist not set, using sequence separation to report coevolving pairs
jamesmkrieger commented 3 years ago

I think there should be no more problems so I'm closing this. Please feel free to reopen if there are.