veg / hyphy-analyses

HyPhy standalone analyses
MIT License
37 stars 17 forks source link

dN, dS values #13

Open kavithamallayaa opened 3 years ago

kavithamallayaa commented 3 years ago

Hi HyPhy team,

I have been using HyPhy analysis( RELAX, BUSTEC, BUSTED ) in my research to study molecular evolution for the past two years. I have some issues in comprehending my dN/dS values now, as I suspect there are more of synonymous substitutions that are not neutral. I think dN and dS values by itself could provide more information.

I would like to estimate the dN, dS, dN/dS with site-to-site synonymous rates implemented for four subsets across a phylogeny. If I can obtain these values from your program, it will provide a better understanding of molecular evolution across my phylogeny. I am happy to get into further discussion on this.

HyPhy team have always been supportive and I hope to obtain the same support in this issue also.

with regards Kavitha

spond commented 3 years ago

Dear @kavithamallayaa,

Can you be a bit more specific? Are you interested in site-level estimates, branch-level estimates, etc? For example, you can use RELAX in group mode to fit different dN/dS distributions to branch groups already, and a dS distribution across sites that is shared by all groups.

Best, Sergei

kavithamallayaa commented 3 years ago

Dear Sergei,

I am looking for branch-level estimates.

I think as dS is not neutral, different dS distribution to different branch groups will explain the data better.

Regards Kavitha

On Fri, 20 Nov 2020 at 01.15 Sergei Pond notifications@github.com wrote:

Dear @kavithamallayaa https://github.com/kavithamallayaa,

Can you be a bit more specific? Are you interested in site-level estimates, branch-level estimates, etc? For example, you can use RELAX in group mode to fit different dN/dS distributions to branch groups already, and a dS distribution across sites that is shared by all groups.

Best, Sergei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy-analyses/issues/13#issuecomment-730403445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMANSDXCSLXCSPGLPFWP3KLSQUSAHANCNFSM4T24KNUQ .

spond commented 3 years ago

Dear @kavithamallayaa,

There are several things you could try.

  1. You could simply estimate the dS tree (synonymous divergence); this will give you a branch-level average effect (FitMG94 local option)

  2. You could fit a branch-site model of dS (using BUSTED with --srv Branch-site option), to allow dS to vary from branch to branch in an unrestricted fashion. This doesn't use group information, but rather evaluates how much variation there is. You could also then look where in the tree rates change (this is currently not implemented, but easy to add).

  3. You could consider a RELAX type test on synonymous rates to directly compare distributions of dS between branch groups. This does not currently exist as an implemented analysis, and would require a modification to RELAX.

  4. You could do a FEL type analysis to test which specific sites may have different synonymous rates across branch groups (like contrast-FEL but for synonymous rates). Also not implemented, would be a modification of Contrast-FEL.

Let me know if this makes sense.

Best, Sergei

kavithamallayaa commented 3 years ago

Dear Sergei,

I have some knowledge on your options 2 and 3. 1 and 4, I have not tried so I need to look into it.

Option 2: I have tried SRV from BUSTED and the results were promising. The issues are: I need this for 4 different branch groups but BUSTED allows only one sub-group and background. If I can get SRV for 3 groups and background( or include the unclassified option like RELAX), then it can help I think. I also need the dN values for this to support my explanation on BUSTED-dN/dS.

Option 3: RELAX based on dS sounds great and can add evidence to my expectations. I am excited to try that, if possible for 4 groups simultaneously. Previously I used the unclassified option to study 4 groups. Please include the unclassified option if 4 groups cannot be done simultaneously (as RELAX-mod did not feature this option).

I will study the other options( 1 and 4 ) and update you.

with regards Kavitha

On Fri, 20 Nov 2020 at 06.20 Sergei Pond notifications@github.com wrote:

Dear @kavithamallayaa https://github.com/kavithamallayaa,

There are several things you could try.

1.

You could simply estimate the dS tree (synonymous divergence); this will give you a branch-level average effect (FitMG94 local option) 2.

You could fit a branch-site model of dS (using BUSTED with --srv Branch-site option), to allow dS to vary from branch to branch in an unrestricted fashion. This doesn't use group information, but rather evaluates how much variation there is. You could also then look where in the tree rates change (this is currently not implemented, but easy to add). 3.

You could consider a RELAX type test on synonymous rates to directly compare distributions of dS between branch groups. This does not currently exist as an implemented analysis, and would require a modification to RELAX. 4.

You could do a FEL type analysis to test which specific sites may have different synonymous rates across branch groups (like contrast-FEL but for synonymous rates). Also not implemented, would be a modification of Contrast-FEL.

Let me know if this makes sense.

Best, Sergei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy-analyses/issues/13#issuecomment-730583936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMANSDRNDKLM5RYWGDCR4HDSQVVXDANCNFSM4T24KNUQ .

spond commented 3 years ago

Dear @kavithamallayaa,

Can you tell me just a little bit more about what type of test you are most interested in?

  1. Do you want to simply examine whether dS distributions of branch/site rates are different between groups?
  2. Do you wish to identify specific branches that have a particular dS and dN properties>
  3. Are you at all interested in patterns at specific sites?

BUSTED is not really designed to compare branches, but rather test whether or not branches have selection (ω>1 etc). RELAX is the most suitable tool for comparing selection among sets of branches. Presently it enforces a shared distribution of dS across all branches, but this could be modified to examine separate distributions of dS and dN. I am always interested in biological problems that motivate these types of analyses. I'll be happy to modify RELAX to do this type of testing, but I'd like to know a bit more to make sure I am doing the right thing.

Best, Sergei

kavithamallayaa commented 3 years ago

Dear Sergei,

I need to estimate the dS, dN and dN/dS for different groups across a phylogeny. I was using RELAX analysis, it worked well for my previous dataset, but for my current data it is difficult to comprehend the results. My thought was that dS sites where the issues as they were not neutral. I read about your site-to-site synonymous variation and tried it through BUSTED. The results were promising but cannot provide evidence as these tests are modelled for a different purpose.

I think RELAX with separate distributions of dN and dS is an awesome way and can support the biology behind the data.

Implementing site-to-site variation is great stuff and could provide better understanding of classic evolutionary theories of molecular evolution.

I think it is more appropriate to have more detailed discussion through emails.

Looking forward to further discussion.

regards Kavitha

On Sat, Nov 21, 2020 at 1:09 AM Sergei Pond notifications@github.com wrote:

Dear @kavithamallayaa https://github.com/kavithamallayaa,

Can you tell me just a little bit more about what type of test you are most interested in?

  1. Do you want to simply examine whether dS distributions of branch/site rates are different between groups?
  2. Do you wish to identify specific branches that have a particular dS and dN properties>
  3. Are you at all interested in patterns at specific sites?

BUSTED is not really designed to compare branches, but rather test whether or not branches have selection (ω>1 etc). RELAX is the most suitable tool for comparing selection among sets of branches. Presently it enforces a shared distribution of dS across all branches, but this could be modified to examine separate distributions of dS and dN. I am always interested in biological problems that motivate these types of analyses. I'll be happy to modify RELAX to do this type of testing, but I'd like to know a bit more to make sure I am doing the right thing.

Best, Sergei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy-analyses/issues/13#issuecomment-731191407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMANSDX3X3ENBDJISQG7ROTSQZ2BTANCNFSM4T24KNUQ .

spond commented 3 years ago

Dear @kavithamallayaa,

We can definitely continue by e-mail. In the meantime, I suggest you try RELAX (with the recent release, 2.5.23) which includes SRV. See https://github.com/veg/hyphy-analyses/tree/master/SRV

If you suspect heterotachy, see if --srv Branch-site modifies RELAX results.

Best, Sergei

kavithamallayaa commented 3 years ago

I will definitely try that.

Can you please provide me your email address for further discussion.

Regards Kavitha

On Sat, 21 Nov 2020 at 07.52 Sergei Pond notifications@github.com wrote:

Dear @kavithamallayaa https://github.com/kavithamallayaa,

We can definitely continue by e-mail. In the meantime, I suggest you try RELAX (with the recent release, 2.5.23) which includes SRV. See https://github.com/veg/hyphy-analyses/tree/master/SRV

If you suspect heterotachy, see if --srv Branch-site modifies RELAX results.

Best, Sergei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy-analyses/issues/13#issuecomment-731401311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMANSDWEM5KG2KJTFL6F5OLSQ3JIDANCNFSM4T24KNUQ .

spond commented 3 years ago

Dear @kavithamallayaa,

spond at temple dot edu

Best, Sergei

gwct commented 2 years ago

Hello, I have a similar request to the OP, but with a key difference: I would like to know the dN and dS values computed or averaged over all branches in my tree, per locus, rather than per branch (so one value per locus). I thought to use FitMG94, but I'm recalling now (and #17 confirms) that the basic model fit approximates dN/dS for a locus without directly computing dN and dS.

I think I could achieve what I'm after by using BUSTED with --srv Yes, setting all branches as test branches, and parsing the dN and dS values from the "Rate distributions" in the "fits" field. But I'm not sure this is a correct or the best solution to getting these values. Another option I can think of is to simply take the dN and dS values that are calculated per branch using FitMG94 with the --type local option.

Any feedback is appreciated. Thanks! -Gregg

spond commented 2 years ago

Dear @gwct,

By locus here, do you mean "gene" (or alignment)? Do you simply want to "unpack" dN/dS, i.e. get the global dN and global dS estimates?

Best, Sergei

gwct commented 2 years ago

Hi @spond, Yes, by locus in this case I mean an alignment for a single gene. And yes I basically just want to unpack the dN/dS estimate (e.g. from FitMG94) to a single dN and a single dS estimate per gene. I just wasn't sure if this is already computed somewhere or if the best thing to do would be to just average across all branches per gene. Thanks! -Gregg

spond commented 2 years ago

Dear @gwct,

Those would be best approximated by summing dS and dN estimated for each branch in the tree in FitMG94.bf output json. You can do with the a simple Python script

import json, sys

with open (sys.argv[1], 'r') as fh:
    results = json.load (fh)
    ES = 0
    EN = 0
    for b, bl in results["branch attributes"]["0"].items():
        ES += bl ["dS"]
        EN += bl ["dN"]

    print ("dS = %g, dN = %g" % ((ES, EN)))

or with a (jq)[https://stedolan.github.io/jq/] one liners

% jq '.["branch attributes"]["0"] | [.[]."dS"] | add' CD2.nex.FITTER.json
2.1055388373159993
 % jq '.["branch attributes"]["0"] | [.[]."dN"] | add' CD2.nex.FITTER.json
1.620885081098434

Best, Sergei

gwct commented 2 years ago

Ok, got it. I thought I could do that but just wanted to be sure it wasn't already computed somewhere.

I notice you say sum the values. Should they not also be divided by the number of branches to get an average, or am I missing something? Thanks! -Gregg

spond commented 2 years ago

Dear @gwct,

There's no average to compute here. The quantity is aggregate over the entire tree, defined as

dS = expected number of synonymous substitutions / expected synonymous sites
dN = expected number of non-synonymous substitutions / expected non-synonymous sites

The numerators are totals over the tree. The denominators do not depend on branches (they are a function of base composition).

Best, Sergei

gwct commented 2 years ago

Right, that makes sense of course, since all the branches have the same denominator. So from the FitMG94.bf output, for dN and dS I'm taking the sum of dN and dS for each branch like you said, and for dN/dS I'm just using the value estimated in the json["fits"]["Standard MG94"]["Rate Distributions"]["non-synonymous/synonymous rate ratio"] field. Thanks again! -Gregg

spond commented 2 years ago

Dear @gwct,

Correct; but the two dN/dS values will not be the same. The directly estimated one (form the json) is the one to use.

Best, Sergei