sriramlab / OrientAGraph

GNU General Public License v3.0
11 stars 1 forks source link

jobs failing based on random seed #8

Open sjfleck opened 1 year ago

sjfleck commented 1 year ago

Thank you for creating this tool. I'm having some trouble getting all of my jobs to run. I'm running up to 20 migrations (overkill for what OptM determined as the optimum number of migrations) and 30 iterations for each run. When I did this with Treemix, everything finished just fine. Now that I'm trying OrientAGraph, some always seem to fail. For example, after the last run, this many OrientAGraph jobs failed: m=0 - 0/30 failed m=1 - 0/30 failed ... m=4 - 0/30 failed m=5 - 1/30 failed m=7 - 2/30 failed m=8 - 8/30 failed m=9 - 5/30 failed m=10 - 13/30 failed ... m=19 - 24/30 failed m=20 - 26/30 failed

the ones that do fail, seem to fail for 3 reasons:

  1. Performing exhaustive search to add migration edge to base tree Migration edge no.3 added ERROR: Calling remove_mig_edge on a non-migrations edge

  2. Performing exhaustive search to add migration edge to base tree orientagraph: /usr/include/boost/graph/detail/adjacency_list.hpp:1202: void boost::bidirectional_graph_helper_with_property::remove_edge(typename Config::edge_descriptor) [with Config = boost::detail::adj_list_gen<boost::adjacency_list<boost::listS, boost::listS, boost::bidirectionalS, Node, Dist>, boost::listS, boost::listS, boost::bidirectionalS, Node, Dist, boost::no_property, boost::listS>::config; typename Config::edge_descriptor = boost::detail::edge_desc_impl<boost::bidirectional_tag, void*>]: Assertion `rng.first != rng.second' failed. Migration edge no.3 added

  3. reaching the maximum run time limit of 72 hours (least common)

The only difference between each run with the same m is the random seed. Here's an example of one of the commands that I submit to my cluster (I have a script that automatically creates scripts and submits jobs for each of the 30 iterations of each value of m, so 630 separate jobs): s=$RANDOM orientagraph -i $VCF.treemix.frq.gz -m 20 -o $VCF.30.20 -root $OG -bootstrap -k 500 -noss -seed $s -allmigs -mlno 1,2 > ${VCF}.30.20.log 2>&1

Any insight as to why only some of these jobs for each m are failing would be greatly appreciated. Thank you

ekmolloy commented 1 year ago

My generic answer to your question is that TreeMix will stop adding migration edges when it does not reduce the residuals anymore. This means that even if you run TreeMix with -m 20, it might only add 5 migration edges and then exit (this is likely to occur when the number of migrations is close to or exceeding the number of populations). In this example, OrientAGraph would fail when -m is set to more than 5. I am going to add in a conditional statement so it exits gracefully (like TreeMix). Based on your results though, it seems like something additional is going on here, but I think it will be hard to de-bug without looking at the input file. If you feel comfortable sharing your input file with me via email (ekmolloy@umd.edu), I am happy to take a look.

sjfleck commented 1 year ago

we have 41 populations in our analysis, so 20 migrations shouldn't be an issue. I'm conformable sharing any files you need. I'll email the .treemix.frq.gz file for now and if you need anything else just ask. Thank you.

sjfleck commented 1 year ago

I just tried running OptM on the OrientAGraph runs that I thought successfully completed, but I ended up with a warning:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 8 elements

I wrote to the author of OptM about this error, but I discovered something on my own since (I also shared this with Dr. Fitak since writing this comment).

I thought these jobs had successfully completed since they had starting and exiting likelihood scores and no errors in the .log files, but I found an additional 115 jobs (of a subset of 286 complete OrientAGraph runs) that had a WARNING in the log file. This warning was not present in the .log files for my original Treemix analysis. I'll give a couple examples of the error and some surrounding lines:

PROJECT.noN.LDpruned.14.4.log (seed= 7485) Checking position of the root WARNING: More than one migration edges are entering the same target! root edge = 87 -> 65 admixture vertices = { 93 94 95 96 } is treebased ...placed root at outgroup. WARNING: Re-rooting graph at outgroup changed llik from 5409.79166044 to 5409.79128639! Final Admixture graph with root 81

PROJECT.noN.LDpruned.3.7.log (seed= 10809) Checking position of the root WARNING: More than one migration edges are entering the same target! root edge = 103 -> 59 admixture vertices = { 99 100 101 102 103 104 105 } ...unable to place root at outgroup. Final Admixture graph with root 81

These may not be a real issue, but this was the only difference I could find between the original Treemix runs and the OrientAGraph runs that I thought had finished successfully. Any insight into these warnings would be greatly appreciated. Thank you

sjfleck commented 1 year ago

Sorry for so many comments before you're able to respond. Dr. Fitak responded and informed that the OptM issues was due to OrientAGraph adding an additional space at the end of the second line in the .llik files. I just removed that space and I successfully generated a new OptM plot for the output of OrientAGraph with 0-12 migrations and 22 iterations each. I guess the warnings that I discovered from my last comment weren't the issue. I'm looking forward to hearing back about the first two errors and if this warning is something I don't need to worry about. Thank you again

ekmolloy commented 12 months ago

Hello,

Apologies for the delay, and thanks for sending your data. I have made some modifications to the code that may resolve your issues. I basically break out of the migration edge addition loop, after a failure to add an edge (because no edge addition improves the likelihood score) and a few other changes of this nature. However, I was unable to confirm whether these changes address your problems, as detailed below.

First, I tried to re-analyze your data with TreeMix version 1.13 using several different seeds (15399, 15819, 16998, 22340, 28689, 30016, 5671, 9955) and the following command:

treemix -seed $SEED -i $DATAFILE  -k 500  -root $OG -m 1 -noss -o $OUTFILE

For all of these runs, TreeMix produced the same starting tree. I checked the log files and found that all runs were adding populations in the same order, even though the seeds differed across runs. This suggests that there could be something going on with the random number generator that differs across GSL/BOOST versions and systems. I replicated the same issue when using OrientAGraph.

This means that it's not possible for me to replicate your runs using seeds alone. In addition, you mentioned that the runtime is quite long. To aid in de-bugging, I added a checkpointing to OrientAGraph.

The checkpointing feature (and the other code updates) are currently available on the debug branch. You can access the code with the following commands:

git clone https://github.com/sriramlab/OrientAGraph.git
cd OrientAGraph
git checkout debug

Then, you can build the code with the following commands:

export INCLUDE_PATH=/opt/homebrew/Cellar/gsl/2.7.1/include
export LIBRARY_PATH=/opt/homebrew/Cellar/gsl/2.7.1/lib
./configure CPPFLAGS=-I${INCLUDE_PATH} LDFLAGS=-L${LIBRARY_PATH} --with-boost="/opt/homebrew/Cellar/boost/1.82.0_1"
make

but replacing the the paths to GSL and BOOST be suitable for your system. I also compile TreeMix version 1.13 in a similar fashion.

Now if you specify the flag -checkpoint, OrientAGraph will write out networks found during the search (called checkpoints). Currently, OrientAGraph will checkpoint the starting graph (either the starting tree or the graph given as input with the -tf or -gf flags), the network found after each migration edge is added, and that same network after reorientation (but only if a better orientation is found). If you run OrientAGraoh and send me checkpoint files, I will be able to re-start OrientAGraph from the last checkpoint before the failure, which will greatly enable de-bugging (if my current fixes didn't address your issue).

While implementing the checkpointing feature, I noticed a few interesting things about your data. First, the starting tree computed has negative branch lengths. This occurs when running TreeMix and it also happens in OrientAGraph (which makes sense because OrientAGraph doesn't change how the starting tree is computed). I am concerned that this is causing some problems and maybe these negative branch lengths need to be forced to 0 to avoid future problems. This impacts the score of the initial tree, which then impacts the hillclimbing heuristic and what moves are accepted. I will send you the data and follow up via email.

Thank you! -Erin