veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

command line arguments saved to json? #1601

Closed dcaffrey closed 1 year ago

dcaffrey commented 1 year ago

Hi, Are the options specified by the user (command line or interactive) saved to the json? I could not find them. If not, this would be a very useful feature to have for the following reasons:

1) The saving of options would ensure that the methods in a paper accurately capture the procedure that was used. 2) I recently ran meme and got slightly different results when I ran it on the command line versus interactive mode. If the options were saved to the json I could be 100% sure that the options were same (I can't tell if the interactive mode is setting something in the background that I don't know about) and the differences I am seeing are due to the method (e.g. a non-deterministic algorithm).

Thanks, Daniel

spond commented 1 year ago

Dear @dcaffrey,

You raise an excellent point and make a great suggestion. It's almost embarrassing that it's not there already, really. At the moment, the only record of selected options goes to stdout and the command line itself. There is no easily accessible record of it in the JSON file. Certain things (file paths, selected branches, versions of analyses and runtime) do get stored because they are needed for visualization, and settings can sometimes be inferred from the presence or absence of fields in the JSON file.

I will add the capture of all the command line keyword argument settings to standard JSON generating analyses and add that to the next version of HyPhy.

Re: MEME -- there is some degree of non-determinism which manifests in two ways

1). When you run HyPhy in a multithreaded environment, there is some degree of stochasticity essentially because floating point operations are not associative; a + b + c ≠ c + a + b, and if different threads return a,b,c in random orders, some differences can be introduced and snowball as many more random orderings are realized for subsequent operations. This is common for other phylogenetic packages as well (e.g. RaxML; Alexis Stamatakis wrote about it, don't recall exactly where). You can force single-threaded execution (CPU=1 command line arg), to remove this source of noise.

2). MEME uses some randomly generated initial values to optimize the mixture model at each site. They will differ from run to run. In most cases, the differences should be minor (within error tolerance). Sometimes, if the likelihood function is "rugged" for a specific site (e.g. multiple local optima), you may get a different result between runs. Unfortunately, systematic automatic diagnostics of such issues are not possible without major additional computation, if at all.

How different were your results?

Best, Sergei

dcaffrey commented 1 year ago

Thanks for the quick reply and glad to hear you are interested in adding the feature.

The differences between MEME were small. Based on the the information in STDOUT I'm 99% confident that the options for each run were identical. There are 611 codons in my alignment and about 28 were reported with P-values <0.1 in one run but not the other. The difference in p-value was small. About 17 of differing codons were at the N or C termini where the alignment quality was poor so I'm inclined to ignore them. Among the remaining 11 codons that are in the "high quality" region of the alignment there are some interesting differences: 4 of the 11 are at/near a binding site so they are most likely true positives. The other 7 are not near a binding site so it is harder to say whether the are true positives or not.

Daniel

github-actions[bot] commented 1 year ago

Stale issue message