veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
206 stars 69 forks source link

Is BGM missing in new version of Hyphy 2.2.4? #322

Closed DavidNickle closed 8 years ago

spond commented 9 years ago

Hi David,

It's there. Art updated the syntax for BGM usage. Do you have a specific example where something breaks?

Sergei

DavidNickle commented 9 years ago

Hi Sergei,

No - it is just not in the location that it use to be in. How do I access it through the GUI app?

David

On Aug 24, 2015, at 9:41 AM, Sergei Pond notifications@github.com wrote:

Hi David,

It's there. Art updated the syntax for BGM usage. Do you have a specific example where something breaks?

Sergei

— Reply to this email directly or view it on GitHub https://github.com/veg/hyphy/issues/322#issuecomment-134294334.

ArtPoon commented 9 years ago

Hi David, Sorry for the delay. The BGM script was never designed to be accessible from the Standard Analyses menu in the HyPhy GUI. The only exception is when you select the BGM co-evolution option under the QuickSelectionDetection workflow, which is under the Positive Selection menu. I ran through this workflow using the command-line build of a recent version of HyPhy and it seems to work ok. Is this the functionality you're looking for?
As for the demo script for BGMs, you can run TestBGM.bf which is at tests/hbltests/BayesianGraphicalModels. There is also a BGM.bf under the TemplateBatchFiles directory that should be up to date. I hope that helps, please let me know if you're still running into problems. Best,

JaneZXJ commented 8 years ago

Dear @ArtPoon :

I wonder how long it will take generally for a dataset of 19 sequences with 438bp, if the dN/dS bias parameter option is Estimate+CI in HyPhy. If it's possible for BGM to detect co-evolution of sites in sub units of heterotetramer, such as α-globin and β-globin.

And if there is some detailed guide information for commands in HyPhy as other models by Sergei?

Thanks so much! Jane

ArtPoon commented 8 years ago

Hi Jane,

Without knowing anything about your hardware or model specification, I am willing to speculate that it should not take longer than half an hour to analyze an alignment of 19 sequences and 439 bases in length.

Estimating the confidence interval may add a bit to computing time, but as I recall this is based on profile likelihood and not too expensive.

Regarding BGM, again you'll have to give me more specifics before I can give you a meaningful estimate. Since BGM is run as a Markov chain Monte Carlo sampler, you can run it for short or as long as you like - but the quality of your results will scale as well.

In general, you don't want to be analyzing many more sites with BGM than your sample size (number of sequences). So if you're referring to the same alignment as before, you want to restrict the BGM analysis to fewer than 20 sites; preferably the most variable. A BGM analysis on this number of variables should not take very long to run, say less than an hour (again complete speculation, YMMV).

You'll have to give me some specifics if you want references to detailed information on HyPhy commands -- is there something you're looking for that's not already in the Commands PDF file?

On Feb 25, 2016, at 10:43 PM, JaneZXJ notifications@github.com wrote:

Dear @ArtPoon :

I wonder how long it will take generally for a dataset of 19 sequences with 438bp, if the dN/dS bias parameter option is Estimate+CI in HyPhy. If it's possible for BGM to detect co-evolution of sites in sub units of heterotetramer, such as α-globin and β-globin.

And if there is some detailed guide information for commands in HyPhy as other models by Sergei?

Thanks so much! Jane

— Reply to this email directly or view it on GitHub.

JaneZXJ commented 8 years ago

Hi Art, Thanks for your reply. But I didn't find the command PDF file for BGM or content in HyPhy.PDF. Did I miss it?

  And I selected less than 10 sites on Spidermonkey webserver 12 hours ago, they are still in running......
  http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.12537961363546.1&task=bgm
  http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.50936074323458.1&task=bgm
  http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.313698912117567.1&task=bgm

  Thanks again for more suggestions!

Jane

Xiaojia Zhu Ph.D student Ornithological research group Key Laboratory of Zoological Systematics and Evolutionary Center 5 Institute of Zoology, CAS 1 Bei Chen West road Chao Yang District Beijing,100101 xiaojia0402@hotmail.com zhuxiaojia@ioz.ac.cn Office: 010-64807188 Phone: 86-13581827838

From: Art Poon Date: 2016-02-27 01:39 To: veg/hyphy CC: JaneZXJ Subject: Re: [hyphy] Is BGM missing in new version of Hyphy 2.2.4? (#322) Hi Jane,

Without knowing anything about your hardware or model specification, I am willing to speculate that it should not take longer than half an hour to analyze an alignment of 19 sequences and 439 bases in length.

Estimating the confidence interval may add a bit to computing time, but as I recall this is based on profile likelihood and not too expensive.

Regarding BGM, again you'll have to give me more specifics before I can give you a meaningful estimate. Since BGM is run as a Markov chain Monte Carlo sampler, you can run it for short or as long as you like - but the quality of your results will scale as well.

In general, you don't want to be analyzing many more sites with BGM than your sample size (number of sequences). So if you're referring to the same alignment as before, you want to restrict the BGM analysis to fewer than 20 sites; preferably the most variable. A BGM analysis on this number of variables should not take very long to run, say less than an hour (again complete speculation, YMMV).

You'll have to give me some specifics if you want references to detailed information on HyPhy commands -- is there something you're looking for that's not already in the Commands PDF file?

On Feb 25, 2016, at 10:43 PM, JaneZXJ notifications@github.com wrote:

Dear @ArtPoon :

I wonder how long it will take generally for a dataset of 19 sequences with 438bp, if the dN/dS bias parameter option is Estimate+CI in HyPhy. If it's possible for BGM to detect co-evolution of sites in sub units of heterotetramer, such as α-globin and β-globin.

And if there is some detailed guide information for commands in HyPhy as other models by Sergei?

Thanks so much! Jane

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

ArtPoon commented 8 years ago

http://www.hyphy.org/pubs/Commands.pdf http://www.hyphy.org/pubs/Commands.pdf

I think this document is superseded by the wiki at: http://hyphy.org/w/index.php/HyPhy_Batch_Language http://hyphy.org/w/index.php/HyPhy_Batch_Language

but is a useful resource nonetheless.

And it sounds like the BGM pipeline on Datamonkey is stuck again.. I’d do a local run.

On Feb 27, 2016, at 6:35 AM, JaneZXJ notifications@github.com wrote:

Hi Art, Thanks for your reply. But I didn't find the command PDF file for BGM or content in HyPhy.PDF. Did I miss it?

And I selected less than 10 sites on Spidermonkey webserver 12 hours ago, they are still in running...... http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.12537961363546.1&task=bgm http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.50936074323458.1&task=bgm http://www.datamonkey.org/cgi-bin/datamonkey/mpiJobStatus.pl?fileName=upload.313698912117567.1&task=bgm

Thanks again for more suggestions!

Jane

Xiaojia Zhu Ph.D student Ornithological research group Key Laboratory of Zoological Systematics and Evolutionary Center 5 Institute of Zoology, CAS 1 Bei Chen West road Chao Yang District Beijing,100101 xiaojia0402@hotmail.com zhuxiaojia@ioz.ac.cn Office: 010-64807188 Phone: 86-13581827838

From: Art Poon Date: 2016-02-27 01:39 To: veg/hyphy CC: JaneZXJ Subject: Re: [hyphy] Is BGM missing in new version of Hyphy 2.2.4? (#322) Hi Jane,

Without knowing anything about your hardware or model specification, I am willing to speculate that it should not take longer than half an hour to analyze an alignment of 19 sequences and 439 bases in length.

Estimating the confidence interval may add a bit to computing time, but as I recall this is based on profile likelihood and not too expensive.

Regarding BGM, again you'll have to give me more specifics before I can give you a meaningful estimate. Since BGM is run as a Markov chain Monte Carlo sampler, you can run it for short or as long as you like - but the quality of your results will scale as well.

In general, you don't want to be analyzing many more sites with BGM than your sample size (number of sequences). So if you're referring to the same alignment as before, you want to restrict the BGM analysis to fewer than 20 sites; preferably the most variable. A BGM analysis on this number of variables should not take very long to run, say less than an hour (again complete speculation, YMMV).

You'll have to give me some specifics if you want references to detailed information on HyPhy commands -- is there something you're looking for that's not already in the Commands PDF file?

  • a

On Feb 25, 2016, at 10:43 PM, JaneZXJ notifications@github.com wrote:

Dear @ArtPoon :

I wonder how long it will take generally for a dataset of 19 sequences with 438bp, if the dN/dS bias parameter option is Estimate+CI in HyPhy. If it's possible for BGM to detect co-evolution of sites in sub units of heterotetramer, such as α-globin and β-globin.

And if there is some detailed guide information for commands in HyPhy as other models by Sergei?

Thanks so much! Jane

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub. — Reply to this email directly or view it on GitHub https://github.com/veg/hyphy/issues/322#issuecomment-189653000.

jrotieno commented 7 years ago

Sorry to take you guys back, but was this really answered?

I have installed hyphy that I can access via commandline through calling HYPHYMPI. However, I don't seem to see the BGM analysis. Where is it?

Thanks.

James

spond commented 7 years ago

Dear @jrotieno,

It is there, just not in the most obvious place; I'll post a solution for you (or @ArtPoon will) in a little while. Are you looking at "stock" BGM implementation for codon data?

Best, Sergei

ArtPoon commented 7 years ago

I have some example scripts and data for using the BGM code under hyphy/tests/hbltests/BayesianGraphicalModels. Give me a second and I'll write a more thorough response, having a conference call right now!

DavidNickle commented 7 years ago

I don’t remember exactly how but I was able to get BGM to work and it gave the results I expected.

This is what I recall.

I had to generate a Nexus file with the newick tree description in the tree block. Then load the nexus file and HyPhy asks do you want to use detected tree you reply yes.

Sorry I could not been more of a help.

David

On Feb 23, 2017, at 6:32 AM, Sergei Pond <notifications@github.com mailto:notifications@github.com> wrote:

Dear @jrotieno https://github.com/jrotieno,

It is there, just not in the most obvious place; I'll post a solution for you (or @ArtPoon https://github.com/ArtPoon will) in a little while. Are you looking at "stock" BGM implementation for codon data?

Best, Sergei

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/322#issuecomment-282006412, or mute the thread https://github.com/notifications/unsubscribe-auth/AK7x-JRZZaNePCo4xzDcaN1xUnb8jpSGks5rfZiNgaJpZM4FxHBm.

jrotieno commented 7 years ago

I have an AA alignment that I would like to use for this.

spond commented 7 years ago

Dear @jrotieno,

I have a reasonably self-contained script for the currently developed (v2.3-dev) branch that does it for nucleotide data; I'll modify it to handle amino-acid data as well. As an added benefit, you can ask to test for coevolution only on a specific subset of tree breaches (e.g. a clade, or all internal branches).

You should be able to tweak it without too much difficulty, I hope.

Best, Sergei

ArtPoon commented 7 years ago

This has come up before, see issue #367

jrotieno commented 7 years ago

@spond and @ArtPoon Just realizing that this isn't as 'select option, upload file.....' as I thought. Apologies, I have never used HyPhy before but used Spidermonkey a lot.

If I'm getting it right, I have to use some sort of config file or template or script to read into the HYPHYMPI?

Thanks

ArtPoon commented 7 years ago
  1. Map amino acid substitution events to branches in your phylogeny. This involves fitting a substitution model to your AA alignment and then performing ancestral reconstruction by maximum likelihood. I have provided a couple of batch files as GitHub gists to do this:
  2. Analyze the distribution of AA substitutions with Bayesian networks (BGMs) to detect correlated substitutions (sets of AA sites that tend to have substitutions co-occurring on the same branches). The gist for this step is here.
  3. Interpret the results. The BGM can output a list of graph edges with marginal posterior probabilities. I've also implemented a function that will output a GraphViz DOT format file that can be rendered as a graph with the right software, but I don't think I've committed those changes yet.

So the overall workflow should look something like this (note the filename arguments are entirely arbitrary):

> HYPHYMPI AnalyzeNucProtData.bf
Please specify a nucleotide or amino-acid data file: your_alignment.fasta
______________READ THE FOLLOWING DATA______________

Then you will be asked for a tree. HyPhy will optimize the AA substitution model and then:

Select a file to save likelihood function: myAminoFit.lf

Next,

> HYPHYMPI MapAAMutationsToTree.bf
Choose a file containing an exported likelihood function: myAminoFit.lf
Retrieved data partition [something]
Retrieved tree [something else]
Ancestor reconstruction option: 1
Output option: 1
Provde a filename to write output(s) to: foobar.csv

And finally:

> HYPHYMPI bgm_demo.bf
Select file containing binary data matrix: foobar.csv

The results will be written to edgelist.out.

jrotieno commented 7 years ago

@ArtPoon

Awesome!!! Thanks so very much for the instructions. I am getting this done ASAP.

Quick one (though you may be asleep now, time zone differences!); How much does the supplied tree impact on the inferred co-evolving sites? For example, If I gave HYPHY an ML tree vis-a-vis a simple NJ tree?

jrotieno commented 7 years ago

@ArtPoon

I finally ran this the BGM analysis. Went through well and got the following error at the HYPHYMPI bgm_demo.bf stage.

What could be the problem?

Attached my input csv file Mut_to_tree_SH_G_F_L_AA.txt

Thanks

ArtPoon commented 7 years ago

Oh, right. I forgot that the map ancestors batch script still outputs a tab-delimited file, it's not a CSV. Second, you need to reduce the number of columns in this binary character matrix. BGM is not going to be happy when some columns are entirely zeroes. Here's a little R to do this filtering step:

> map <- read.table("/Users/art/Downloads/Mut_to_tree_SH_G_F_L_AA.txt", sep="\t", header=F)
> dim(map)
[1]  157 3128
> table(apply(map, 2, sum))
   0    1    2    3    5 
3059   65    2    1    1 
# most of the columns are empty -- there isn't much signal in these data
> map.2 <- map[ , apply(map, 2, sum)>0]
> write.csv(map.2, file="toBGM.csv", quote=FALSE, row.names=FALSE)

This will give you a file that you can run with bgm_demo.bf. CAVEAT It looks to me like there are not enough AA substitution events in the evolutionary history of your data to detect correlated substitutions.

jrotieno commented 7 years ago

@ArtPoon,

The output was actually csv. I just uploaded txt as GitHub rejected the csv.

I think the error could be as a result of inadequate substitution events.

On the contrary for the same dataset, Spidermonkey got me some correlated substitutions but resorted to local analysis as Spidermonkey limits to 75 sites only yet I had about 89 variable sites for analysis from my alignment.

ArtPoon commented 7 years ago

@jrotieno, could you please post the error you're seeing? I think you might have forgotten to include it with your previous post.

jrotieno commented 7 years ago

@ArtPoon, this was the error `Read 156 cases from file. Detected 3128 variables. Error:

Master node received an error:ERROR: Number of levels in data (1) for discrete node 0 is not compatible with node setting (2). Check your data or reset the BayesianGraphicalModel.

Function call stack 1 : Set parameter BGM_DATA_MATRIX of bgm to data

2 : ExecuteCommands in string "SetParameter ("+_bgm+", BGM_DATA_MATRIX, data);" using basepath /usr/local/lib/hyphy/TemplateBatchFiles/.

3 : attach_data("bgm",mat,0,0,0)


MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------`

ArtPoon commented 7 years ago

The BGM script does not automatically detect the number of levels for each discrete variable (factor). It also assumes that these levels are progressively numbered with integer values starting from zero. So a binary character should be encoded by 0's and 1's. This error tells me that there is at least one column in your data matrix that contains exclusively 0's (or 1's, but that's less likely).

On Feb 27, 2017, at 8:10 AM, James Richard Otieno notifications@github.com wrote:

@ArtPoon, this was the error `Read 156 cases from file. Detected 3128 variables. Error:

Master node received an error:ERROR: Number of levels in data (1) for discrete node 0 is not compatible with node setting (2). Check your data or reset the BayesianGraphicalModel.

Function call stack 1 : Set parameter BGM_DATA_MATRIX of bgm to data

2 : ExecuteCommands in string "SetParameter ("+_bgm+", BGM_DATA_MATRIX, data);" using basepath /usr/local/lib/hyphy/TemplateBatchFiles/.

3 : attach_data("bgm",mat,0,0,0)

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------`

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jrotieno commented 7 years ago

@ArtPoon I did the filtering and then ran the new csv with bgm_demo.bf

Got the error below

Read 157 cases from file.
Detected 1490 variables.
[MacBook-Pro:20992] *** Process received signal ***
[MacBook-Pro:20992] Signal: Bus error: 10 (10)
[MacBook-Pro:20992] Signal code: Non-existant physical address (2)
[MacBook-Pro:20992] Failing at address: 0x7fff9a73ec28
[MacBook-Pro:20992] [ 0] 0   libsystem_platform.dylib            0x00007fff91d66bba _sigtramp + 26
[MacBook-Pro:20992] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[MacBook-Pro:20992] [ 2] 0   HYPHYMPI                            0x000000010413487e _ZN23_BayesianGraphicalModel13SetDataMatrixEP7_Matrix + 1502
[MacBook-Pro:20992] [ 3] 0   HYPHYMPI                            0x0000000103fd3999 _ZN18_ElementaryCommand18HandleSetParameterER14_ExecutionList + 3497
[MacBook-Pro:20992] [ 4] 0   HYPHYMPI                            0x0000000103fc320b _ZN18_ElementaryCommand7ExecuteER14_ExecutionList + 5163
[MacBook-Pro:20992] [ 5] 0   HYPHYMPI                            0x0000000103fc435b _ZN14_ExecutionList7ExecuteEv + 635
[MacBook-Pro:20992] *** End of error message ***
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

HYPHYMPI terminated.
Error:
HyPhy killed by signal 15

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to bgmData
-------

--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node MacBook-Pro exited on signal 10 (Bus error: 10).
--------------------------------------------------------------------------
[MacBook-Pro.local:20988] 6 more processes have sent help message help-mpi-api.txt / mpi-abort
[MacBook-Pro.local:20988] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
ArtPoon commented 7 years ago

The excessive error messaging is due to an exception being thrown while running in an MPI environment. There is still a problem with your data matrix, or at least how the BGM script is interpreting it. To clarify the error, you may want to run in a single-threaded instance, e.g., with HYPHYMP. Can you please upload it to GitHub, or alternatively tell me what filters you applied to the previous binary matrix?