veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

Visualization of aBSREL results #1676

Closed WRran-Hydrogen closed 5 months ago

WRran-Hydrogen commented 6 months ago

Dear @spond ,

Thanks for your previous help, I got the results for the aBSREL model. but strangely it does not have the select pressure results for the site, probably because of the larger dataset. After that, I want to visualize the results with ggtree in R. I would like to ask hyphy if they have a function to convert NEXUS files from JSON? (For convenience I have attached my results file.🫡)

Best, Yao out.json

spond commented 6 months ago

Dear @WRran-Hydrogen,

Last month I updated aBSREL to version 2.5, which includes a lot of detailed information (see http://vision.hyphy.org/absrel), including at site-level, e.g.

image

Regarding JSON, I generally recommend using jq, which is a superb utility for extracting and processing these types of files. Can you be more specific about the desired export content/format?

Best, Sergei

WRran-Hydrogen commented 6 months ago

Dear @spond ,

OK, I just want to get this tree file (newick file). image After that I can use tqor JSON'spython module to extract (thanks to your recommendation of tq) the corresponding information for the tree annotation. One question is that I found the tree file in the jsonfile, but it doesn't have the information about the branch lengths, I'm curious how you mapped the tree file to the Nucleotide GTR. image

Another small question is that 703 alignments containing 1309 codonsite took a month to run hyphy on 32 core compute node, this is too high time consuming, do you have any good idea?

spond commented 6 months ago

Dear @WRran-Hydrogen,

The JSON files for aBSREL store branch length information in a separate object, since there's more than one model which can be used to determine branch lengths.

Specifically, you will find them in the "branch attributes" object with keys corresponding to the names in the tree topology.

$jq '.["branch attributes"]' tests/data/bglobin.nex.ABSREL.json

{
  "0": {
    "BUSHBABY": {
      "Baseline MG94xREV": 0.05190629582419842,
      "Baseline MG94xREV omega ratio": 0.4276372944939755,
      "Corrected P-value": 1,
      "Full adaptive model": 0.06425072518686313,
      "Full adaptive model (non-synonymous subs/site)": 0.03732139152551418,
      "Full adaptive model (synonymous subs/site)": 0.02692933366134916,
      "LRT": 0.5206747157799327,
      "Nucleotide GTR": 0.04517814106887804,
      "Rate Distributions": [
        [
          0.1757084634912426,
          0.9788521349090655
        ],
        [
          11.52862779940965,
          0.02114786509093447
        ]
      ],
      "Rate classes": 2,
      "Uncorrected P-value": 0.3253482546661789,
      "original name": "BUSHBABY",
      "posterior": [
        [
          0.9924805556484324,
          0.9958656268001885,
          0.7578008011181845,
          0.8240155208190809,

or

 jq -r '.["branch attributes"]["0"] | to_entries| map ([.key,.value["Nucleotide GTR"]]) [] | @csv  ' tests/data/bglobin.nex.ABSREL.json  

"BUSHBABY",0.04517814106887804
"CHICKEN",0.02523869181803026
"COW",0.02091027373492859
"DUCK",0.0438131515651915
"ELEPHSEAL",0.06849401632550524
"HAMSTER",0.09971426134210891
"HARE",0.005990205406934171
"HUMAN",0.0552201187847943
"MARSUPIAL",0.1948510592760009
"MOUSE",0.05851217515600016
"Node1",0.03043291644580744
"Node10",0.01072910735191889
"Node12",0.05315275138034392
"Node15",0.01631764913672979
"Node18",0.07053120745486645
"Node2",0.01641731439369391
"Node20",0.03005650409217756
"Node23",0.07795825025286905
"Node25",0.03555698487352726
"Node26",0.1442280763306011
"Node29",0.2891510149729472
"Node4",0.01293281296355458
"Node6",0.04462090787982554
"Node9",0.01980447658515092
"PIG",0.09712620791989478
"RABBIT",0.02253173750743446
"RAT",0.04117743614339072
"SHEEP",0.03036672099624838
"TARSIER",0.06592118593336037
"XENLAEV",0.1227158754464014
"XENTROP",0.1925856260944267

Best, Sergei

spond commented 6 months ago

Dear @WRran-Hydrogen,

Regarding long run times. Are you saying that a single gene took a month to run? That's unusual. Can you please share the input data and the command you've used to run aBSREL?

Best, Sergei

WRran-Hydrogen commented 6 months ago

Dear @spond,

Command: hyphy absrel --alignment Domain.msa --tree maxcc.contree --output out.json ENV="TOLERATE_NUMERICAL_ERRORS=1;" Tree file: maxcc_contree.txt Alignment file: Domain.txt Output file: out.json Version: HYPHY 2.5.57(MP) for Linux on x86_64 Appreciate your help!

spond commented 5 months ago

Dear @WRran-Hydrogen,

Sorry for the delay in responding. The run time (on my M1 max MacBook) is ~1 day (that's 4-6 cores).

However, there seem to be some issues with the alignment/analysis, which is probably due to the level of sequence divergence.

  1. The nucleotide model returns branch lengths of >400 subs/site on the entire tree. That's effectively saturated/non-homologous, so that downstream analyses will either struggle to converge or won't make much sense.

    * 1 partition. Total tree length by partition (subs/site) 422.232
  2. During the fitting of individual branches, the longest ones are done first and many of those are effectively infitine as well.

              Branch               |  Length  |  Rates   |     Max. dN/dS     |    Log(L)     |     AIC-c     |Best AIC-c so far|
|-----------------------------------|----------|----------|--------------------|---------------|---------------|-----------------|
|IMGVR_UViG_3300024548_000656_330...|1463.01...|    2     |    0.11 (86.69%)   |  -620953.97   |  1247191.07   |   1247191.07    |
|IMGVR_UViG_3300024548_000656_330...|1463.01...|    3     |    0.67 (77.93%)   |  -620953.33   |  1247193.80   |   1247191.07    |
|IMGVR_UViG_3300005281_000426_330...|1462.71...|    2     |    0.08 (78.74%)   |  -620946.70   |  1247180.54   |   1247180.54    |
|IMGVR_UViG_3300005281_000426_330...|1462.71...|    3     |    0.08 (78.78%)   |  -620946.70   |  1247184.57   |   1247180.54    |
|IMGVR_UViG_3300002460_002005_330...|1462.69...|    2     |    0.23 (72.88%)   |  -620940.85   |  1247172.87   |   1247172.87    |
|IMGVR_UViG_3300002460_002005_330...|1462.69...|    3     |    0.25 (72.88%)   |  -620940.85   |  1247176.90   |   1247172.87    |
|IMGVR_UViG_3300003401_000205_330...|1462.68...|    2     |    0.06 (84.64%)   |  -620935.49   |  1247166.19   |   1247166.19    |
|IMGVR_UViG_3300003401_000205_330...|1462.68...|    3     |    0.06 (84.64%)   |  -620935.49   |  1247170.21   |   1247166.19    |
|              Node659              |1462.66...|    2     |    0.06 (83.35%)   |  -620929.65   |  1247158.52   |   1247158.52    |
|              Node659              |1462.66...|    3     |    0.26 (62.06%)   |  -620928.94   |  1247161.13   |   1247158.52    |
|IMGVR_UViG_3300006025_000592_330...|1462.61...|    2     |    0.07 (78.48%)   |  -620919.15   |  1247141.55   |   1247141.55    |
|IMGVR_UViG_3300006025_000592_330...|1462.61...|    3     |    0.08 (72.01%)   |  -620919.06   |  1247145.38   |   1247141.55    |
|IMGVR_UViG_3300018950_000246_330...|1462.60...|    2     |    0.06 (79.02%)   |  -620914.64   |  1247136.54   |   1247136.54    |
|IMGVR_UViG_3300018950_000246_330...|1462.60...|    3     |    0.18 (45.69%)   |  -620914.43   |  1247140.15   |   1247136.54    |
|IMGVR_UViG_GVMAG_S_3300000558_93...|1462.59...|    2     |    0.06 (80.40%)   |  -620909.06   |  1247129.42   |   1247129.42    |
|IMGVR_UViG_GVMAG_S_3300000558_93...|1462.59...|    3     |    0.06 (80.37%)   |  -620909.06   |  1247133.44   |   1247129.42    |
|IMGVR_UViG_3300017288_000012_330...|1462.58...|    2     |    0.05 (81.10%)   |  -620901.03   |  1247117.37   |   1247117.37    |
|IMGVR_UViG_3300017288_000012_330...|1462.58...|    3     |    0.07 (73.06%)   |  -620900.93   |  1247121.19   |   1247117.37    |
|IMGVR_UViG_3300009173_001818_330...|1462.56...|    2     |    1.00 (84.66%)   |  -620895.92   |  1247111.18   |   1247111.18    |
|IMGVR_UViG_3300009173_001818_330...|1462.56...|    3     |   28.04 (73.80%)   |  -620893.15   |  1247109.65   |   1247109.65    |
|IMGVR_UViG_3300009173_001818_330...|1462.56...|    4     |   21.78 (73.91%)   |  -620893.14   |  1247113.66   |   1247109.65    |
|IMGVR_UViG_GVMAG_M_3300017724_5_...|1462.54...|    2     |    0.09 (73.10%)   |  -620871.10   |  1247069.59   |   1247069.59    |

I would suggest that you take another look at the sequences, especially those branches and clades which have infinite branch lengths (see a part of the tree where such branches are apparent below)

image

Best, Sergei

WRran-Hydrogen commented 5 months ago

Dear @spond,

Sorry for the delay in my reply. Thanks for your patient assistance, I recheck my data and realized that the quality of the alignment is too poor due to the high diversity of the sequence (3+ subs./site). There does not seem to be a good way to deal with this problem at this moment, maybe I need to cluster (hclustor mmseqs2) the data before considering selection pressure or foucus only on the upstream and downstream sequences of conserved motifs. Mang thanks again for your help and keep in touch🫡!

Cheers, Yao