uym2 / TreeShrink

Implementation of the TreeShrink problem
https://uym2.github.io/TreeShrink/
GNU General Public License v3.0
36 stars 11 forks source link

parse species name question #7

Closed andrewalverson closed 3 years ago

andrewalverson commented 5 years ago

Thanks for writing this software. I'm analyzing a large number of alignments that contain as many as a few hundred species. The FASTA header for each sequence in the dataset is unique, containing information about its gene or protein ID or Trinity assembly. Every sequence name/header is, however, consistently prefixed with a species name. Because these are gene trees, the same species can occur multiple times in an alignment. If I understand correctly, TreeShrink doesn't do any parsing of the sequence names to isolate the species name – it just uses the sequence name as the species ID. As I understand it, each sequence is therefore interpreted by TreeShrink as a rare species. Would it be difficult to parse the sequence name for the species ID so that datasets like the one I've described could be run in 'per-species' mode? For example, maybe the user could provide a list of the species names that prefix the sequence ID. Thanks.

smirarab commented 5 years ago

Andrew, that is a good suggestion and we will look into it in future releases.

In the meantime, if you are comfortable with python, the lines that need to change are (I think): https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L125 and https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L156

Map string s to whatever name you want in these two lines. In each case, a variable s has the name in the gene tree. If you map it to your desired name (however you do in python; e.g., regex or a mapping dictionary) before saving it in the dictionaries occ and mapping I think it would all work. At least it will tell you which species to cut from which genes.

To fix the actual removing from the tree (which may fail), I think this line needs to be also change: https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L233 (and perhaps others).

andrewalverson commented 5 years ago

I’ll give it a go – thanks!

On Jan 24, 2019, at 7:42 PM, Siavash Mirarab notifications@github.com wrote:

Andrew, that is a good suggestion and we will look into it in future releases.

In the meantime, if you are comfortable with python, the lines that need to change are (I think):

https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L125 and https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L156

map string s to whatever name you want in these two lines. In each case, a variable s has the name in the gene tree. If you map it to your desired name (however you do in python; e.g., regex or a mapping dictionary) before saving it in the dictionaries occ and mapping I think it would all work. At least it will tell you which species to cut from which genes.

To fix the actual removing from the tree (which may fail), I think this line needs to be also change: https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L233 (and perhaps others).

Thanks

On Thu, Jan 24, 2019 at 12:15 PM Andrew Alverson notifications@github.com wrote:

Thanks for writing this software. I'm analyzing a large number of alignments that contain as many as a few hundred species. The FASTA header for each sequence in the dataset is unique, containing information about its gene or protein ID or Trinity assembly. Every sequence name/header is, however, consistently prefixed with a species name. Because these are gene trees, the same species can occur multiple times in an alignment. If I understand correctly, TreeShrink doesn't do any parsing of the sequence names to isolate the species name – it just uses the sequence name as the species ID. As I understand it, each sequence is therefore interpreted by TreeShrink as a rare species. Would it be difficult to parse the sequence name for the species ID so that datasets like the one I've described could be run in 'per-species' mode? For example, maybe the user could provide a list of the species names that prefix the sequence ID. Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uym2/TreeShrink/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAybuJ48BewiRtE5qPhV9fibEi6MnZ1pks5vGhRegaJpZM4aRodt .

-- Siavash Mirarab — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uym2/TreeShrink/issues/7#issuecomment-457424992, or mute the thread https://github.com/notifications/unsubscribe-auth/AGWVtR_hlfsOtWUfeAdNRbNuBdPPjNIfks5vGmEJgaJpZM4aRodt.

uym2 commented 3 years ago

In the new releases (v1.3.5 or above), you can use -g to map sequence name to species name.