veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

dataset with mixed code tables #1530

Closed dongzhang0725 closed 1 year ago

dongzhang0725 commented 1 year ago

Dear developers,

I am analyzing a Bilateria mitogenome dataset with multiple generic code tables (code tables 2, 5 and 9), but it appears that Hyphy only allows a single code table. I wonder if there is a workaround this problem, i.e. if it would be possible to run hyphy with a dataset partitioned according to different coding tables?

Sincerely,

Dong

spond commented 1 year ago

Dear @dongzhang0725,

Do you have different coding tables in the same alignment, or do you have several alignments with different coding tables? In the first case, you have a situation where the state space of the evolutionary process (# of stop codons, and codon=>amino acid table) changes. This cannot be handled by standard evolutionary models, and would require developing something entirely new. For example, if you have the following clade in the tree ((A,B),(C,D)), and A,B use one coding table and C,D use another -- what do you model their ancestor as, and where along the branch does the usage switch?

Can you possibly elaborate on your research question and why does the situation with multiple codes come up?

Best, Sergei

dongzhang0725 commented 1 year ago

Dear Sergei,

Thank you for your quick reply. My dataset matches the first case. There are 20-something phyla in our bilaterian dataset (a single alignment used to reconstruct the phloygeny). However, different lineages employ different codes; for example, Platyhelminthes employ code table 9, whereas Arthropoda, Gastrotricha, Annelida employ code table 5, and vertebrates employ code table 2. We wish to compare selection pressures across different phyla. We tried to resolve the problem by splitting dataset along the code lines into three subdatasets. According to your response, currently this is the only option that we have at disposal.
To answer your question, the demarkation lines between different codes are generally well known, so deifning the dataset in these terms should not be too difficult.

Best wishes, Dong

spond commented 1 year ago

Dear @dongzhang0725,

I would suggest you split the datasets by genetic code and do a joint analysis on them; combining them in a single tree (if the clades are fully resolved, as they should be, intuitively), creates a lot of headaches, and does not gain that much. Can you tell me which of the selection analyses you want to run, and I can give you suggestions on how to proceed. Unfortunately, there is no "easy out of the box solution"

Best, Sergei

dongzhang0725 commented 1 year ago

Dear Sergei

We want to test the hypothesis of relaxation of selection pressures in selected test branches against the reference branches. for this we aimed to use RELAX. We also wish to test the hypothesis of intensified directional (adaptive) evolution in selected test branches agianst the backdrop of reference branches. For this we aimed to use aBSREL.

Sincerely,

Dong

dongzhang0725 commented 1 year ago

Dear Sergei

We want to test the hypothesis of relaxation of selection pressures in selected test branches against the reference branches. for this we aimed to use RELAX. We also wish to test the hypothesis of intensified directional (adaptive) evolution in selected test branches agianst the backdrop of reference branches. For this we aimed to use aBSREL.

Sincerely,

Dong

spond commented 1 year ago

Dear @dongzhang0725,

Great, this makes sense. Would you be able to send me the picture of a tree showing your Test/Reference branch sets and also which of the taxa use different codon tables? I want to better understand the data dependancies, so that I can think of the simplest way to craft a RELAX modification.

For aBSREL you should be able to run it on each of the sets of sequences (all using the same genetic code) separately.

Best, Sergei

dongzhang0725 commented 1 year ago

Dear Sergei,

We are studying the correlation between parasitism and molecular evolution rates. Parasitism has multiple independent origins within the bilateria. In our dataset, Acanthocephala are 100% parasitic, whereas Nematoda, Arthropoda and Platyhelminthes have interspersed parasitic and free-living lineages. We are testing the hypotheses that purifying selection pressures are relaxed in parasitic lineages vs. free-living lineages. The second hypothesis is that adaptive selection levels are higher in parasitic lineages. As we cannot test them on the overall dataset due to it comprising different code tables, we are trying to test it on several different subdatasets. 2G%6`JEURA8W5A3RQ)@MOFQ See above figure for details. As the dataset contains too many species (around 11000), I only upload phylum level tree, with the leaf name in the format of "phylum|code table|life history". In life history, F is free-living, EndoP is endoparasitic, EctoP is ectoparasitic, and MP is micropredatory. The number near life history is the corresponding number of species.

Sincerely,

Dong

spond commented 1 year ago

Dear @dongzhang0725,

Quite an interesting problem. Since you have a very deep deep (11,000!), I would suggest you do subtree analysis. Here's why.

  1. Computational performance: both aBSREL and RELAX runtimes will grow (roughly) non-linearly, so say a 500-sequence data set will run ~100 slower than a 50-sequence dataset.
  2. Statistical performance: both aBSREL and RELAX will have degraded performance over very large trees -- aBSREL uses a per-branch multiple testing correction, so running over >10,000 branches will likely result in near complete loss of power -- RELAX will probably over smooth (a single distribution of dN/dS over large tree will be an underfit).
  3. aBSREL for a given branch is not going to draw much information from very distant branches in the tree.

I personally would do the following at first

  1. Pick a clade where you have phenotype variation where there is a SINGLE codon table (e.g. Nematoda)
  2. Annotate the phenotypes using something like LabelTrees
  3. Run the following analyses -- aBSREL and RELAXscan to explore selection profiles. This will give you counts of branches by phenotype that are selected -- Run a group mode RELAX (since you have THREE groups here). This will tell you if you have relaxation, intensification of one group relative to another -- Run BUSTED-PH to see if there is an effect of phenotype on positive selection -- Run Contrast-FEL to identify individual sites, where there's discordant selective pressure by phenotype

Best, Sergei

dongzhang0725 commented 1 year ago

Dear Sergei,

Thank you for your useful suggestions, currently we are running a subset dataset (around 250 species) which was selected to only contain all parasites and their closed-related free living species under a single codon table (5). However, we encounterred an error in one of our RELAX runs, both using the old (2.5.14(MP)) and the new version (2.5.42(MPI)) of hyphy:

### Fitting the alternative model to test K != 1
Error:

Master node received an error:Internal error 

See attachment log file for details. relax_code5_EctoP_F_represent.log Can you infer what causes the error in this analysis?

Sincerely,

Dong

dongzhang0725 commented 1 year ago

Dear @spond ,

We are trying to identify signatures of positive selection in TEST vs. BACKGROUND branches using aBSREL method. However, the output does not seem to provide test vs. background results, but merely reports branches and nodes for which there is evidence of a significant positive selection. Is there a way to test the hypothesis that the signal for positive selection is significantly higher in the foreground (test) group branches vs. the background branches, comparative to the one that RELAX analysis allows? The running code of aBSREL is: mpirun -n 30 HYPHYMPI aBSREL --alignment concatenation.fas --code Invertebrate-mtDNA --tree tree.tre --test Foreground

Sincerely, Dong

spond commented 1 year ago

Dear @dongzhang0725,

If you are interested in comparing selective pressures between groups of branches at the level of the entire alignment, consider BUSTED-PH.

If you are interested in comparing selective pressures between groups of branches at the level of individual sites, consider Contrast-FEL.

Best, Sergei

dongzhang0725 commented 1 year ago

Dear @spond ,

Thank you for your reply.

I encounterred the same error that I posted before when using a new dataset.

Error:

Master node received an error:Internal error 
Internal error in ComputeBranchCache (branch xOmDalqP.tree_id_0.Node90, eval #1066534 ) reversible model cached likelihood =   -559234.7103695555, directly computed likelihood =   -559233.3601133478. This is most likely because a non-reversible model was incorrectly auto-detected (or specified by the model file in environment variables; for smaller errors, this could be due to numerical instability of calculations for larger alignments).

Function call stack
1 :  [namespace = xOmDalqP] Optimize(mles, likelihoodFunction);
-------
2 :  relax.alternative_model.fit=estimators.FitLF(relax.filter_names,relax.trees,{"0":relax.model_map},relax.general_descriptive.fit,relax.model_object_map,{terms.run_options.retain_lf_object:TRUE,terms.run_options.optimization_log:relax.optimization_log_file("MainALT-log.json")});
-------
3 :  relax.FitMainTestPair();
-------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[coevo:234303] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[coevo:234303] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

relax_code9_ectop_F_all_flatworm_group.log

Can you infer what causes the error in this analysis?

Best wishes,

Dong

spond commented 1 year ago

Dear @dongzhang0725,

Unfortunately, these internal errors are impossible to diagnose without running the data and recreating the fault condition. Would you mind sharing the file with me (privately by e-mail spond at temple dot edu)?

Whenever a user comes across an internal error, I really like to get to the bottom of the issue and add a patch to prevent it from happening.

Best, Sergei

spond commented 1 year ago

Dear @dongzhang0725,

There is indeed an issue with this dataset (some calculations were making larger than expected numerical errors), but you should be able to run it to completion (I was) using the following additional command line argument ENV="TOLERATE_NUMERICAL_ERRORS=1;"

hyphy relax --alignment concatenation.fas --tree tree_marked.tre --starting-points 5 --code Echinoderm-mtDNA  --test Foreground ENV="TOLERATE_NUMERICAL_ERRORS=1;" 

I should also remark that relax does not benefit from MPI (only multi-core), so you need not run it in an MPI environment (use hyphy CPU=N to ask HyPhy to use N cores instead; by default it will try to use all available cores.

Best, Sergei

dongzhang0725 commented 1 year ago

Dear Sergei,

Thank you for your suggestion, I will try it. By the way, does BUSTED method benifit from MPI?

Sincerely, Dong

dongzhang0725 commented 1 year ago

Dear @spond ,

I tried your code (hyphy CPU=30 RELAX --alignment concatenation.fas --code Echinoderm-mtDNA --tree tree_marked.tre --test Foreground --starting-points 5 ENV="TOLERATE_NUMERICAL_ERRORS=1;") on the "relax_run1_internal_error" dataset that I sent to you via email before, but I still got an internal error. Abeit, this time with a different error content::

### Fitting the alternative model to test K != 1
Error:
Internal error 
Internal error in _LikelihoodFunction::ConjugateGradientDescent. The function evaluated at current parameter values [-553437.0796143888] does not match the last recorded LF maximum [-553433.5260891961]

Function call stack

1 :  [namespace = fBxfPpkg] Optimize(mles, likelihoodFunction);
-------
2 :  relax.alternative_model.fit=estimators.FitLF(relax.filter_names,relax.trees,{"0":relax.model_map},relax.general_descriptive.fit,relax.model_object_map,{terms.run_options.retain_lf_object:TRUE,terms.run_options.optimization_log:relax.optimization_log_file("MainALT-log.json")});
-------
3 :  relax.FitMainTestPair();
-------

Check errors.log for execution error details.

relax_code9_ectop_f_all_flatworm.log

I successfully managed to conduct this run on a smaller dataset, comprising a proportion of the species inclucded. So we suspect that it is either the size of the dataset or outlier species that may have caused it. We would appreciate greatly if you could help us infer what causes this error. Thank you!

Sincerely, Dong

github-actions[bot] commented 1 year ago

Stale issue message