veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

Applying partitions from GARD to only some genes in alignment? #1671

Closed katherinermartin closed 4 months ago

katherinermartin commented 6 months ago

I have a gene that has been translocated across chromosomes. I would like to see if the gene copies on the different chromosomes (let's call them A and B) have experienced different rates of non-synonymous substitution (e.g., contrast FEL, aBSREL). However, I don't expect that A and B have evolved in the same way, so I ran GARD separately on alleles from copy A and those from copy B. I find that there is no significant evidence for recombination for A, but there is a recombination breakpoint for B (at 117 bp).

I understand how to input the partitioned tree file into the HyPhy tests to account for recombination in B ("screen_data" files from GARD). But I would like to run tests on HyPhy that include both A and B in the same phylogeny. Is there a file type or format to provide HyPhy with a tree file that includes both A and B, and also partitions only B? And even if this is possible, is it violating an assumption of the models run by HyPhy?

For further context: I did indeed try running GARD on a combined dataset of both A and B alleles, and GARD found a recombination breakpoint at 185 bp. So I am trying to avoid using the combined dataset, since I don't feel the recombination breakpoint at 185 bp reflects an actual recombination event, as it is not consistent with the results of GARD when considering A and B separately (no breakpoints, and one breakpoint at 117 bp, respectively)

Thanks for any insight you can provide!

spond commented 6 months ago

Dear @katherinermartin,

How strong are GARD results (what's the ΔAIC, and do you get different trees if you infer them from different partitions)? The short fragment (117bp) gives me pause; I'd encourage you to make sure the result is robust. Did you run GARD with site-to-site rate variation enabled?

As far as the combined analysis goes, the easiest way to do it is to use the combined tree. A and B should be reciprocally monophyletic (based on your description, but I could be quite wrong as well), so all you are really doing is stitching the trees together. It would be a good test of how robust the results are to infer separate trees from a joint A+B alignment using the GARD breakpoint at ~117bp and see if

1). The A part is stable, and separate from B. 2). The B part still looks like there's evidence of recombination.

Best, Sergei

PS How many sequences are you analyzing, and, roughly, what is the divergence level?

katherinermartin commented 6 months ago

Hi Sergei,

Thanks so much for the prompt response!

How strong are GARD results (what's the ΔAIC, and do you get different trees if you infer them from different partitions)? The short fragment (117bp) gives me pause; I'd encourage you to make sure the result is robust. Did you run GARD with site-to-site rate variation enabled?

For each gene copy, I ran 2- and 3-rate class and evaluated with no site to site variation, site-to-site with a general discrete distribution, and site-to-site with a beta gamma distribution. The results were consistent per gene copy across these different parameters (summarized in table below). Let me know if you'd like to see any of the .json files for these results.

Screen Shot 2023-12-12 at 12 33 20 PM

PS How many sequences are you analyzing, and, roughly, what is the divergence level?

For gene copy B, I originally had 294 alleles but used tn93-cluster to reduce it to a number that GARD would run (in this case, -t = 0.17 to drop it to 22 alleles). I did not have to reduce the number for gene copy A and ran with all 14 alleles I recovered. Of note: gene copy A is 289 bp and gene copy B is 167 (will have to be shortened to 288 bp and 165 bp respectively, for HyPhy analyses). The two copies have good homology and alleles from either copy have percent identity between 50-70%. These are immune genes, so they are fairly rapidly evolving in some cases.

As far as the combined analysis goes, the easiest way to do it is to use the combined tree. A and B should be reciprocally monophyletic (based on your description, but I could be quite wrong as well), so all you are really doing is stitching the trees together.

You are correct, they are reciprocally monophyletic (just to be clear that I'm understanding: they produce their own, well-supported monophyletic clades when I run a bayesian phylogeny)

It would be a good test of how robust the results are to infer separate trees from a joint A+B alignment using the GARD breakpoint at ~117bp and see if

1). The A part is stable, and separate from B. 2). The B part still looks like there's evidence of recombination.

I might be misunderstanding; should I run two separate trees, one based on the first partition, and the other based on the second partition after breakpoint, and compare the tree topologies? Or do you mean running GARD with A and B combined? I reran it just now with 11 alleles from gene copy A and 11 from gene copy B (I find if I run more than ~20 alleles, GARD throws an error and so I use tn 93 to reduce the number of alleles).

I found three different possible breakpoints, summarized below (none of which are the ~117 bp breakpoint I found from just running gene copy B by itself). I can send along the .json files, but for all, the A and B gene copies still form monophyletic groups, but within them, the relationships are not stable: for gene copy A, it's hard to tell since the branches are so short or it could be a polytomy, but it's clear for gene copy B that the trees are not stable between the fragments.

Screen Shot 2023-12-13 at 12 31 15 PM

Edited to update the final table for clarity

spond commented 6 months ago

Dear @katherinermartin,

May I ask why copy B on its own has 167 sites but the joint alignment has 291? The location of the breakpoint may have moved with the alignment, since it looks like copy A may be longer? With that in mind, the breakpoints could actually be consistent.

I'd say you do have pretty good evidence for recombination; my one concern for downstream analyses is that these segments are so short (only ~50 bp in copy B alignment).

For your downstream analyses the easiest thing to do is just take the trees from the A+B joint analyses (assuming the recombination breakpoint is "stable" after MSA coordinate corrections).

Best, Sergei

PS I noticed you used tn93 compression -- very nice!

katherinermartin commented 6 months ago

May I ask why copy B on its own has 167 sites but the joint alignment has 291? The location of the breakpoint may have moved with the alignment, since it looks like copy A may be longer?

Soo that's where this project gets fun 😅 They're the same gene, but the primers for copy B amplify a shorter fragment of the gene, whereas copy A primers target a larger fragment of 289 bp. The alignment is 291 bp (and not 289 bp) because of a 3 bp indel in copy A.

To add to the fun, copy A appears to only target a single gene copy, since we only recover a maximum of two alleles per individual (across nearly 400 individuals), whereas what I'm calling gene copy B is likely multiple gene copies (the gene is a highly duplicated gene family)

With that in mind, the breakpoints could actually be consistent.

I just checked and you're right! The 183 bp breakpoint identified in the joint A+B analysis under the 3 rate beta gamma dist and 2 rate general discrete corresponds to original base 114 bp in copy B. When I run GARD on copy B on its own, I got a consistent breakpoint at 107 bp; do you think that 107 and 114 seem close enough to be considered a stable breakpoint?

I'd say you do have pretty good evidence for recombination; my one concern for downstream analyses is that these segments are so short (only ~50 bp in copy B alignment).

I hear you. The way we generally approach selection analyses is to perform several site-based analyses, and then only consider the sites identified in at least two out of four tests we run (FEL, FUBAR, MEME, SLAC) as having evidence of positive selection (and I'll likely run contrast FEL on this joint alignment too to see if rates differ between gene copies). Assuming that the short fragment will affect or violate the assumptions of each test in a different way, perhaps I can still be confident in any site identified across tests?

For your downstream analyses the easiest thing to do is just take the trees from the A+B joint analyses (assuming the recombination breakpoint is "stable" after MSA coordinate corrections).

Will do; thank you so so much Sergei, you've been a huge help!

PS I noticed you used tn93 compression -- very nice!

Thank you--you suggested it to me about a year and a half ago, and it's been a life saver with these gene families with high allelic polymorphism!! 😄

github-actions[bot] commented 4 months ago

Stale issue message