Closed andrewalverson closed 3 years ago
Andrew, that is a good suggestion and we will look into it in future releases.
In the meantime, if you are comfortable with python, the lines that need to change are (I think): https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L125 and https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L156
Map string s
to whatever name you want in these two lines. In each case, a variable s
has the name in the gene tree. If you map it to your desired name (however you do in python; e.g., regex or a mapping dictionary) before saving it in the dictionaries occ
and mapping
I think it would all work.
At least it will tell you which species to cut from which genes.
To fix the actual removing from the tree (which may fail), I think this line needs to be also change: https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L233 (and perhaps others).
I’ll give it a go – thanks!
On Jan 24, 2019, at 7:42 PM, Siavash Mirarab notifications@github.com wrote:
Andrew, that is a good suggestion and we will look into it in future releases.
In the meantime, if you are comfortable with python, the lines that need to change are (I think):
https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L125 and https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L156
map string
s
to whatever name you want in these two lines. In each case, a variables
has the name in the gene tree. If you map it to your desired name (however you do in python; e.g., regex or a mapping dictionary) before saving it in the dictionariesocc
andmapping
I think it would all work. At least it will tell you which species to cut from which genes.To fix the actual removing from the tree (which may fail), I think this line needs to be also change: https://github.com/uym2/TreeShrink/blob/master/run_treeshrink.py#L233 (and perhaps others).
Thanks
On Thu, Jan 24, 2019 at 12:15 PM Andrew Alverson notifications@github.com wrote:
Thanks for writing this software. I'm analyzing a large number of alignments that contain as many as a few hundred species. The FASTA header for each sequence in the dataset is unique, containing information about its gene or protein ID or Trinity assembly. Every sequence name/header is, however, consistently prefixed with a species name. Because these are gene trees, the same species can occur multiple times in an alignment. If I understand correctly, TreeShrink doesn't do any parsing of the sequence names to isolate the species name – it just uses the sequence name as the species ID. As I understand it, each sequence is therefore interpreted by TreeShrink as a rare species. Would it be difficult to parse the sequence name for the species ID so that datasets like the one I've described could be run in 'per-species' mode? For example, maybe the user could provide a list of the species names that prefix the sequence ID. Thanks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uym2/TreeShrink/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAybuJ48BewiRtE5qPhV9fibEi6MnZ1pks5vGhRegaJpZM4aRodt .
-- Siavash Mirarab — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uym2/TreeShrink/issues/7#issuecomment-457424992, or mute the thread https://github.com/notifications/unsubscribe-auth/AGWVtR_hlfsOtWUfeAdNRbNuBdPPjNIfks5vGmEJgaJpZM4aRodt.
In the new releases (v1.3.5 or above), you can use -g to map sequence name to species name.
Thanks for writing this software. I'm analyzing a large number of alignments that contain as many as a few hundred species. The FASTA header for each sequence in the dataset is unique, containing information about its gene or protein ID or Trinity assembly. Every sequence name/header is, however, consistently prefixed with a species name. Because these are gene trees, the same species can occur multiple times in an alignment. If I understand correctly, TreeShrink doesn't do any parsing of the sequence names to isolate the species name – it just uses the sequence name as the species ID. As I understand it, each sequence is therefore interpreted by TreeShrink as a rare species. Would it be difficult to parse the sequence name for the species ID so that datasets like the one I've described could be run in 'per-species' mode? For example, maybe the user could provide a list of the species names that prefix the sequence ID. Thanks.