veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

How to process the results files from OrthoFinder into input files for HyPhy ABSREL? #1723

Open SWei2333 opened 1 month ago

SWei2333 commented 1 month ago

Hi,

I have read many related issues, but I'm still confused. From OrthoFinder, I can get a file including all OG.fasta. The single-copy orthologs are no problem, but for the multi-copy fasta files, should I remove duplicates (paralogs) using remove-duplicates.bf to get a unique sequence per species, or could I rename these paralogs and treat them as different species to analyze positive selection together?

Looking forward to your comment. Thanks very much!

stevenweaver commented 1 month ago

Dear @SWei2333,

Thank you for your inquiry and use of HyPhy.

When dealing with multi-copy orthologs from OrthoFinder:

  1. For aBSREL Analysis: Paralogs only marginally effect the analysis, so you can include multiple paralogs per species without removing them, but it will cause some computational overhead. Please see this comment by @spond for the reasoning.
  2. Removing Duplicates: You can use remove-duplicates.bf to remove identical sequences. This step retains one sequence per species, selected randomly if they are identical.

Please refer to this thread for additional dialogue on this question.

Best, Steven

SWei2333 commented 1 month ago

Thank you for your reply, and I'll try !

SWei2333 commented 1 month ago

Dear Steven,

I tested an orthogroup (OG) containing 607 sequences but only 27 species. After running the remove-duplicate.bf script, 599 sequences remained, which is still too many for 27 species. Could you provide some advice?

Here is the command I used:

/data/software/hyphy-2.5.62/hyphy /data/software/hyphy-analyses/remove-duplicates/remove-duplicates.bf --msa OG0000003.filter.fa --tree OG0000003.filter.fa.treefile --output uniques.fas ENV="DATA_FILE_PRINT_FORMAT=9"

stevenweaver commented 1 month ago

Dear @SWei2333,

For pruning your dataset, please see this detailed comment by @spond regarding aBSREL performance and species selection.

Best, Steven

SWei2333 commented 1 month ago

I am very sorry, I read that response, but I am not sure if we are addressing the same issue. I have 27 species, and I want to maintain this number. However, there are cases where certain species have multiple copies of a gene, leading to discrepancies between the gene tree and the species tree. Not all copies cluster on the same branch, and I am confused about which copy to retain in such cases. Do you have any suggestions? Is it possible to retain the copy that is consistent with the topology of the species tree?

spond commented 1 month ago

Dear @SWei2333,

My personal suggestion would be to do one of two things

  1. Use all paralogs (gene copies) for every analysis and a gene tree. This approach retains all the data and won't give you headaches when genes/gene copies are not the same as the species tree. In general, unless you have a very good reason not to do that, I strongly suggest using gene trees in all cases. Making topology errors, i.e. forcing a species tree onto a gene that has a different history (for whatever reason), is going to yield biased results for that gene.

  2. Use a consistent strategy not based on phylogenetic placement to pick one gene per species before you run any of the analyses, and then, if you really need to, use a species tree for all genes. If you build a tree and then use that tree to select a subset of the data for an analysis that relies on the same tree, you will be committing a cardinal statistical sin :) (cherrypicking etc).

HTH, Sergei

SWei2333 commented 1 month ago

Dear Sergei Pond, Thank you for your advice, it's helpful. Regarding your suggestion, I have two small questions.

  1. The first suggestion: If I retain all the gene copies and use the gene tree for analysis, should I realign and rebuild the tree with different copies and other orthologous genes, running HyPhy multiple times, or should I rename them, put them in the same gene set, and run HyPhy once?
  2. The second suggestion: You mentioned "a consistent strategy not based on phylogenetic placement." Could you provide an example?

Best wishes Wei

Sergei Pond @.***> 于2024年7月23日周二 00:15写道:

Dear @SWei2333 https://github.com/SWei2333,

My personal suggestion would be to do one of two things

1.

Use all paralogs (gene copies) for every analysis and a gene tree. This approach retains all the data and won't give you headaches when genes/gene copies are not the same as the species tree. In general, unless you have a very good reason not to do that, I strongly suggest using gene trees in all cases. Making topology errors, i.e. forcing a species tree onto a gene that has a different history (for whatever reason), is going to yield biased results for that gene. 2.

Use a consistent strategy not based on phylogenetic placement to pick one gene per species before you run any of the analyses, and then, if you really need to, use a species tree for all genes. If you build a tree and then use that tree to select a subset of the data for an analysis that relies on the same tree, you will be committing a cardinal statistical sin :) (cherrypicking etc).

HTH, Sergei

— Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/1723#issuecomment-2243335875, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE64AMQQCSAEL7CJTOTZNUV3XAVCNFSM6AAAAABLEIB656VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGMZTKOBXGU . You are receiving this because you were mentioned.Message ID: @.***>

spond commented 1 month ago

Dear @SWei2333,

For (1), use all copies, build a separate tree for each gene, run HyPhy once.

For (2), I don't have a "universal" strategy to offer, because it depends on the data/system. You could pick them based on annotation/sequencing quality, e.g. transcript data, abundance, etc. You could choose to discard everything that is not species-level monophyletic (i.e. gene conversion or something like that). It really depends on the question. I would say if there's no substantive evidence that a particular paralog is functional, chuck it.

Best, Sergei

SWei2333 commented 1 month ago

Thank your very much for your advice! It's very helpful!

Best wishes

Sergei Pond @.***> 于2024年7月25日周四 02:41写道:

Dear @SWei2333 https://github.com/SWei2333,

For (1), use all copies, build a separate tree for each gene, run HyPhy once.

For (2), I don't have a "universal" strategy to offer, because it depends on the data/system. You could pick them based on annotation/sequencing quality, e.g. transcript data, abundance, etc. You could choose to discard everything that is not species-level monophyletic (i.e. gene conversion or something like that). It really depends on the question. I would say if there's no substantive evidence that a particular paralog is functional, chuck it.

Best, Sergei

— Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/1723#issuecomment-2248676692, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE2Q2P72Q3XOKK5SX3DZN7YNPAVCNFSM6AAAAABLEIB656VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY3TMNRZGI . You are receiving this because you were mentioned.Message ID: @.***>