jbloomlab / phydms

phylogenetic analyses informed by deep mutational scanning data
GNU General Public License v3.0
14 stars 4 forks source link

Sites in ExpCM preferences file must start with 1? #3

Closed mbdoud closed 8 years ago

mbdoud commented 8 years ago

I was wondering if there was any particular reason for the checks in phydms that the preferences file be numbered starting with "1" (line 333 of phydms):

if not (min(sites) == 1 and max(sites) - min(sites) == len(sites) - 1): raise ValueError("Sites in preferences file must start at 1 and be consecutive")

Would it be sufficient to just make sure that the number of sites in the preferences file matches the length of the alignment provided? I think it would be much more flexible if the preferences file didn't have to start at site "1", but I can't immediately tell how much work (if any) it would take in the phydms code to do this.

jbloom commented 8 years ago

phydms indexes your alignment sequentially (1, 2, 3, ..., L). Therefore, it needs a set of preferences for each site. If your preferences don't start until amino-acid 2, then phydms doesn't have experimental information to assign an ExpCM for site 1. Therefore, you can't reasonably include a position in the alignment if you don't have preferences for it, as it is impossible to build and ExpCM for it. So if you don't want to include position 1 in your analysis, then it shouldn't be in your alignment either. After you remove it, your previous position 2 is now the first position in the alignment.

Therefore, renumber your preferences so that they match one-to-one with the data in your actual alignment. You can do this pre and post analysis with programs like http://jbloom.github.io/dms_tools/dms_editsites.html and http://jbloom.github.io/phydms/phydms_renumber_prog.html.

If you are using a non-sequentially numbered protein (or lack data for some sites), this means that you will have to keep track of the alternative numbering scheme yourself. However, this is the standard for virtually all phylogenetic programs. It analyzes the alignment that you give it -- if you want to remove sits from that alignment or name the sites by some other numbering scheme you are free to do so, but you do that externally from the phylogenetics program.