Open SWei2333 opened 1 month ago
Dear @SWei2333,
Thank you for your inquiry and use of HyPhy.
When dealing with multi-copy orthologs from OrthoFinder:
Please refer to this thread for additional dialogue on this question.
Best, Steven
Thank you for your reply, and I'll try !
Dear Steven,
I tested an orthogroup (OG) containing 607 sequences but only 27 species. After running the remove-duplicate.bf script, 599 sequences remained, which is still too many for 27 species. Could you provide some advice?
Here is the command I used:
/data/software/hyphy-2.5.62/hyphy /data/software/hyphy-analyses/remove-duplicates/remove-duplicates.bf --msa OG0000003.filter.fa --tree OG0000003.filter.fa.treefile --output uniques.fas ENV="DATA_FILE_PRINT_FORMAT=9"
Dear @SWei2333,
For pruning your dataset, please see this detailed comment by @spond regarding aBSREL performance and species selection.
Best, Steven
I am very sorry, I read that response, but I am not sure if we are addressing the same issue. I have 27 species, and I want to maintain this number. However, there are cases where certain species have multiple copies of a gene, leading to discrepancies between the gene tree and the species tree. Not all copies cluster on the same branch, and I am confused about which copy to retain in such cases. Do you have any suggestions? Is it possible to retain the copy that is consistent with the topology of the species tree?
Dear @SWei2333,
My personal suggestion would be to do one of two things
Use all paralogs (gene copies) for every analysis and a gene tree. This approach retains all the data and won't give you headaches when genes/gene copies are not the same as the species tree. In general, unless you have a very good reason not to do that, I strongly suggest using gene trees in all cases. Making topology errors, i.e. forcing a species tree onto a gene that has a different history (for whatever reason), is going to yield biased results for that gene.
Use a consistent strategy not based on phylogenetic placement to pick one gene per species before you run any of the analyses, and then, if you really need to, use a species tree for all genes. If you build a tree and then use that tree to select a subset of the data for an analysis that relies on the same tree, you will be committing a cardinal statistical sin :) (cherrypicking etc).
HTH, Sergei
Dear Sergei Pond, Thank you for your advice, it's helpful. Regarding your suggestion, I have two small questions.
Best wishes Wei
Sergei Pond @.***> 于2024年7月23日周二 00:15写道:
Dear @SWei2333 https://github.com/SWei2333,
My personal suggestion would be to do one of two things
1.
Use all paralogs (gene copies) for every analysis and a gene tree. This approach retains all the data and won't give you headaches when genes/gene copies are not the same as the species tree. In general, unless you have a very good reason not to do that, I strongly suggest using gene trees in all cases. Making topology errors, i.e. forcing a species tree onto a gene that has a different history (for whatever reason), is going to yield biased results for that gene. 2.
Use a consistent strategy not based on phylogenetic placement to pick one gene per species before you run any of the analyses, and then, if you really need to, use a species tree for all genes. If you build a tree and then use that tree to select a subset of the data for an analysis that relies on the same tree, you will be committing a cardinal statistical sin :) (cherrypicking etc).
HTH, Sergei
— Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/1723#issuecomment-2243335875, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE64AMQQCSAEL7CJTOTZNUV3XAVCNFSM6AAAAABLEIB656VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGMZTKOBXGU . You are receiving this because you were mentioned.Message ID: @.***>
Dear @SWei2333,
For (1), use all copies, build a separate tree for each gene, run HyPhy once.
For (2), I don't have a "universal" strategy to offer, because it depends on the data/system. You could pick them based on annotation/sequencing quality, e.g. transcript data, abundance, etc. You could choose to discard everything that is not species-level monophyletic (i.e. gene conversion or something like that). It really depends on the question. I would say if there's no substantive evidence that a particular paralog is functional, chuck it.
Best, Sergei
Thank your very much for your advice! It's very helpful!
Best wishes
Sergei Pond @.***> 于2024年7月25日周四 02:41写道:
Dear @SWei2333 https://github.com/SWei2333,
For (1), use all copies, build a separate tree for each gene, run HyPhy once.
For (2), I don't have a "universal" strategy to offer, because it depends on the data/system. You could pick them based on annotation/sequencing quality, e.g. transcript data, abundance, etc. You could choose to discard everything that is not species-level monophyletic (i.e. gene conversion or something like that). It really depends on the question. I would say if there's no substantive evidence that a particular paralog is functional, chuck it.
Best, Sergei
— Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/1723#issuecomment-2248676692, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE2Q2P72Q3XOKK5SX3DZN7YNPAVCNFSM6AAAAABLEIB656VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY3TMNRZGI . You are receiving this because you were mentioned.Message ID: @.***>
Hi,
I have read many related issues, but I'm still confused. From OrthoFinder, I can get a file including all OG.fasta. The single-copy orthologs are no problem, but for the multi-copy fasta files, should I remove duplicates (paralogs) using remove-duplicates.bf to get a unique sequence per species, or could I rename these paralogs and treat them as different species to analyze positive selection together?
Looking forward to your comment. Thanks very much!