Spidermonkey/BGM batch file template for amino acid sequences?

aavilaherrera commented 9 years ago

Hi all — I was wondering if there was an example or template batch file for running Spidermonkey/BGM on protein alignments locally.

I found the the example for nucleotides in res/TemplateBatchFiles/QuickSelectionDetection.bf but I'm having trouble adapting it for amino acid sequences to match the Spidermonkey webserver. I'm not sure exactly which variables I need to set in order to do the amino acid analysis. I am guessing these are the lines where the magic happens: _bgm_data = obtainBGMParameters ("lf"); and ml_bgm_results = runBGM (_bgm_data); . I assume I need something from the model fitting (AnalyzeNucProtData.bf) and ancestral mapping (Utility/AncestralMapper.bf) batch files, but I am getting lost navigating through the #includes, ExecuteAFile()s, and LoadFunctionLibrary()s.

I would greatly appreciate any pointers to code I could adapt or for which batch files to modify.

Thanks, a

ArtPoon commented 9 years ago

You can find better example code for using BGM in standalone HyPhy in tests/hbltests/BayesianGraphicalModels/TestBGM.bf

The QuickSelectionDetection.bf code is a bit of a kludge, I would not use it as a template. Assuming that you want to analyze the amino acid substitution process using BGMs (and not the observed distribution of amino acids), you will need some ancestral reconstruction of AAs in the tree. Also assuming that you are working with a protein sequence alignment and not codon sequences, you can fit a model of AA evolution with AnalyzeNucProtData.bf (good guess!). Next you need to map AA substitution events to your tree. I will dig around for how to do this and my carpal is acting up so please be patient :)

other a

aavilaherrera commented 9 years ago

Thanks for the quick reply! I am interested in the amino acid substitution process and I am working with protein (not codon) sequences.

Assuming that you want to analyze the amino acid substitution process using BGMs (and not the observed distribution of amino acids)

If I understand the method correctly, I do want the to analyze the substitution process. The idea is to obtain a matrix of substitution events indexed by branches (rows) and sequence alignment columns (columns). And then learn the conditional probabilities among the columns by learning the structure of a BGM where the nodes are the columns. Right?

If one were to analyze the observed distributions of amino acids, would the matrix then have 20*alignment length columns and number of sequences rows, similar to the matrices in PSICOV? Perhaps the amino acid counts could be corrected similarly to attempt to correct for phylogeny, but I'm interested in evaluating the first approach first.

Thanks again and best wishes, a

ArtPoon commented 9 years ago

Ok, you need to grab these files from my git repo: https://gist.github.com/ArtPoon/865da574e13517955fb0 https://gist.github.com/ArtPoon/1838da20c383e540076c

The first file (AnalyzeNucProtData.bf) is a slight modification of the template batch file so that you can save your AA model fit to a file.
The second file (MapAAMutationsToTree.bf) is a more extensive modification of a batch script that we use to map substitution events to a tree. Save as "List" to look over the substitution map and compare it to the alignment - this is a good sanity check. Then run it again and save as "Binary" to get a CSV matrix of sequences (rows) and amino acids (columns). This is the raw material for BGM inference.

ArtPoon commented 9 years ago

I don't think HyPhy has PSICOV but it is not difficult to specify your own matrix by selecting CUSTOM. There is also a broad range of template matrices available in AnalyzeNucProtData.bf.

To continue, once you've generated the binary state matrix (where 1 indicates the occurrence of an amino acid substitution on the respective branch and site), you will want to winnow the matrix down a bit using something like R to select columns with the most action.

Then grab this file: https://gist.github.com/ArtPoon/a7a9a5168c8aeb7513eb

and run it on the binary state matrix. Make sure to specify whether your CSV file has a header row or not. (It is nice to have column labels that correspond to the amino acids of your alignment coordinates). For some reason my copy of HyPhy is exiting before it gets to the last bit (outputting results) and I haven't figured out why yet :-(

aavilaherrera commented 9 years ago

Thanks very much, I'll try these ASAP.

ArtPoon commented 9 years ago

Any luck? If so then I can close this issue :-)

aavilaherrera commented 9 years ago

Thanks for the batch files, I think I was able to run one analysis successfully. Here are a few obstacles I ran into.

1. `AnalyzeNucProtData.bf`

AnalyzeNucProtData.bf worked fine, but I had to put all the batch files in HYPHY's template batch file directory to run them (e.g. ~/lib/hyphy/TemplateBatchFiles).

2. `MapAAMutationsToTree.bf`

In running MapAAMutationsToTree.bf, if I select List output, it segfaults with the following error:

Error:
Invalid Matrix Index [0][190] in a 1 by 20 matrix.

Function call stack
1 : fprintf(writeToFile,rep,"\t",bn,"\t",site+1,"\t",acids[aa2],"-->",acids[aa1],"\n")
-------
2 : reconstructAncestors(0)
-------
Segmentation fault (core dumped)

But, selecting the binary option prints out a matrix with what I think are the proper dimensions: number of rows equal to the number of branches in the tree and number of columns equal to the number of columns in the alignment.

3. `bgm_demo.bf`

After adding in my own headers and removing columns that only have ones or zeros, I ran bgm_demo.bf. I modified the max_parents to 1 in nodes[Abs(nodes)] = add_discrete_node(names[i], 1, 0, 2) for undirected edges.

Leaving in columns with all ones or zeros in the matrix causes errors like:

Read 7 cases from file.
Detected 204 variables.
Error:
ERROR: Number of levels in data (1) for discrete node 10 is not compatible with node setting (2).  Check your data or reset the BayesianGraphicalModel.

Function call stack
1 : Set parameter BGM_DATA_MATRIX of bgm to data
-------
2 : ExecuteCommands in string "SetParameter ("+_bgm+", BGM_DATA_MATRIX, data);" using basepath /home/aram/lib/hyphy/TemplateBatchFiles/.
-------
3 : attach_data("bgm",mat,0,0,0)
-------
Segmentation fault (core dumped)

Column 10 in the input matrix was all zeros. At least I think that is what nlevels is about (number of states the node can take?).

Success

In all, after filtering like you suggested with R and adding the headers, I did get some output, first printed to stdout, and then it saved in a file edgelist.out in HYPHY's template batch files directory (where I put the batch files in order to run them (e.g. ~/lib/hyphy/TemplateBatchFiles)).

The output looks like a CSV with columns first column, second column, score. Where I think the score is the posterior probability for the edge given the amino acid substitution mapping onto the tree?

Speculative questions

Do you think HYPHY is likely to explode during the AnalyzeNucProtData and MapAAMutationsToTree steps if I give it an alignment and tree with ~5000 sequences and leaves? I'm assuming the number of alignment columns is the main hurdle for the BGM step and the order MCMC being O(N)? suggests I'll be able to run on 1000's of columns? Assuming I don't run out of RAM?

unsolicited feedback

Is there a recommended solution to dealing with long complicated file paths non-interactively? I have some unwieldy path names that are hard to input correctly by hand. Can HYPHY search for input files in the directory it was started from? Or should I automatically generate another bf file with the paths hardcoded in for each run?

ArtPoon commented 9 years ago

Thanks for the extensive feedback!

Re: segfault in MapAAMutationsToTree.bf, I suspect that is a gap character state. I will revise the Gist.

Yes nlevels corresponds to the number of values (levels) that a variable (factor) takes in the corresponding column. So for a column that is all zeroes, the number of levels is 1. You don't want to bother with columns like this because there will be no information for fitting the BGM.

The score column is indeed a posterior probability - it is a marginal edge posterior probability that is essentially telling you the fraction of sampled BGMs that contained the corresponding edge. If your output is undirected then it is the fraction of BGMs that contained the edge A->B or A<-B. Otherwise the edge is directed as first column -> second column.

HyPhy should be able to handle trees with 5,000 sequences or more. You will want to constrain the branch lengths to estimates from a phylogenetic reconstruction program like RAxML or FastTree2. I'll post another script for doing this.

I think the most convenient with dealing with long file paths is to make the user navigate to the target directory and extract the path from the variable LAST_FILE_PATH, and prefix every subsequent file path with this info.

Cheers,

Art

aavilaherrera commented 9 years ago

Thanks! I appreciate your help very much.

ArtPoon commented 9 years ago

Here's a script for fitting an AA model to a tree while constraining the branch lengths to scale by some global factor: https://gist.github.com/ArtPoon/197ecf53efe079999cdc

ArtPoon commented 9 years ago

One other thing: when there is a number of files in one or more directories that I want to iterate over, I usually create a text file with the file paths and then read these paths into HyPhy, and then iterate over the resulting vector:

SetDialogPrompt ("Select file containing paths:");
fscanf (PROMPT_FOR_FILE, "Lines", paths);
for (f = 0; f < Columns(paths); f = f + 1) {
    fprintf (stdout, "processing ", paths[f], "\n");
}

aavilaherrera commented 9 years ago

That last script worked great, I just added in my own template model here

ExecuteAFile (HYPHY_LIB_DIRECTORY+"TemplateBatchFiles"+DIRECTORY_SEPARATOR+"TemplateModels"+DIRECTORY_SEPARATOR+"Custom_AA.mdl");

Thanks!

ArtPoon commented 9 years ago

Hey great, glad it worked out for you :-) Caveat: I hacked those scripts together from existing stuff. While I did some sanity checks, you should check some outputs (such as inspecting the list of reconstructed substitutions and making sure they make sense given your alignment, tree, and ancestral sequence estimates).

aavilaherrera commented 9 years ago

I spot checked a few columns. And I believe they look okay (though I may have missed something, I've been staring at the computer too long). The extant amino acids match the alignment, and the substitutions along the branches match the tree.

For example,

0   Node6   6   L-->A
0   5503_Esc    6   L-->I
0   3733_Bac    6   L-->M
0   338_Anae    6   A-->S

(5503_Esc:1.0326,((338_Anae:1.22346,854_Geob:1.02921):1.09141,(8694_Alk:2.55265,3733_Bac:2.28814):0.0786):0.25751);

shows an L at the 6th column getting substituted as it "rolls" down the branches. The two leaves not shown keep the ancestral L and an A from Node6. Of the 7 branches, 4 had substitutions and there are 4 1s in the corresponding column of the binary matrix.

Thanks again for your help! Hope this post helpful to someone down the line, too!

veg / hyphy