jbloomlab / phydms

phylogenetic analyses informed by deep mutational scanning data
GNU General Public License v3.0
14 stars 4 forks source link

shortened headers to 253 characters downstream of mafft #28

Closed skhilton closed 4 years ago

skhilton commented 4 years ago

This pull request fixes a bug caused by mafft shortening sequence names. In phydms_prepalignment function MAFFT_CDSandProtAlignments there are two dictionaries keyed by prealignment sequence names. However, mafft trims sequence names to 253 characters so there was mismatch of sequence names pre- vs. post-alignment if the original name was greater than 253 characters. The current solution was to create this dictionaries using only the first 253 characters. The script now also checks that trimming to the first 253 characters still results in unique sequence names.

There is one major downside to this solution. I could not find a reference to the shortening of sequence names in the mafft documentation. This means that if mafft changes this behavior, which seems unlikely, phydms_prepalignment will break again. A more complicated solution would be to check if the post-alignment name is a substring of the pre-alignment name but it is difficult to key a dictionary this way.

Because a mafft behavior change seems unlikely, I went with the simple fix. But I understand if @jbloom thinks a refactor is worth it for more robust behavior.

This pull request closes issue #25

jbloom commented 4 years ago

@skhilton: Can you increment version number and CHANGELOG too?