lgatto / pRoloc

A unifying bioinformatics framework for organelle proteomics
http://lgatto.github.io/pRoloc/
15 stars 14 forks source link

Document manual MCMC processing #113

Open lgatto opened 6 years ago

lgatto commented 6 years ago

Currently there's the tagmMcmcProcess function, but we need to document in details how to process the chains manually (before starting with the GUI).

I would suggest to add a section in the vignette. That section won't be reproducible, because a full MCMCParam object, with four (or more) full chains is too large to store with the package. What we can do is the put that object on a server so that people can download it manually and reproduce the results. The code chunks and the outputs will have to be copied from the console, unfortunately.

@ococrook - could you start this, please.

ococrook commented 6 years ago
LaurentGatto commented 6 years ago

I am working on helper functions (partly pushed), although not yet incorporated in the TAGM vignette. Here are some suggestions for the vignette:

They all oscillate around an average of 360 outliers.

The number of outliers should be similar for each chain, but what number of proportion is acceptable. How should my run compare to 360? Can I, for example, have 2000 for similar training parameters?

Reminder: any changes need to be done in inst/extdoc/mcmc.R.

ococrook commented 6 years ago

Yes, some clarification is needed. Thanks for adding the helper functions - though I think we should rename mcmc_thin_chains to mcmc_burn_chains, since I think this function was to remove the front portion of the chain. We should reserve mcmc_thin_chains for when we sub-sample the chain.

I'll think the Rhat statistic will need more detail. I thought about adding some more diagnostic test too. A review of most methods are given here: http://www.math.pitt.edu/~swigon/Homework/brooks97assessing.pdf

http://www.stat.columbia.edu/~gelman/research/published/brooksgelman2.pdf argue that 1.2 for the statistics is sufficient, however this is based on an heurisitic argument rather than a "proof". I think if it is less than 1.1 no question would be asked. I have personal preference for 1.05.

lgatto commented 6 years ago

Thanks - replaced thin with burn - it would be useful to review these names anyway later.

Feel free to update the mcmc.Rmd files with some clarification and link you suggest above.

ococrook commented 6 years ago

When plotting the posteriors for each protein at the end, one chain out of the four possible is used. It happens to be the last one that was initiated within to for loop that discarded some iterations. I assume it doesn't matter which one, as they all have been quality checked. This needs clarification though.

I'll write a function that pools the 4 chains into 1, so that all chains are used. This will give better quality results.

lgatto commented 6 years ago

The pooling function is not merged (see #116). Do you still need to update tagmMcmcProcess to make use of it, as it seems that there's some code that is duplicated now? Do you also plan to update the MCMC in details section to demonstrate this new function?

lgatto commented 6 years ago

By the way, it's probably worth later simplifying the names of these helper function to thin, pool, plot (which I have already done), ... (without making generics/methods out of them though). But this is something that can be ironed out later.

ococrook commented 6 years ago

Some code it duplicated, but these function might change slightly in the future so I don't want one to depend on the other and it come back a bite us later. Yes, I plan on updating the documentation. I want to write a few more helper functions so that the analysis is streamlined as possible but still flexible.

lgatto commented 6 years ago

Re code duplication, it is the point of one function using the other, to avoid that the manual processing (including pooling) is different from running tagmMcmcProcess. Otherwise, you'll have to change the code twice.

ococrook commented 6 years ago

I've added the helper functions I think we need, including access to more diagnostics help. Do I re-write tagmMcmcProcess to use the pool function? It would be good to do some code review and polish these functions. From here I think I can write up some more detailed documentation. More function may need to be added but they are not obvious to me at the moment.

lgatto commented 6 years ago

Thanks, I'll have a look.

In the future, if you want explicit review of the code (and I think it's generally a good thing to do), I suggest you clone the repo, send a pull request, and as for an explicit review as part of the merging process. Helps to track the review and resulting changes and updates better.

lmsimp commented 6 years ago

hi @ococrook I see it's already on the list above to add to the vignette but I second @lgatto's comment about needing more information about the Rhat statistic's. Using TAGM for the first time and looking at results from Claire's data german.diag gives

> gelman.diag(out)
Potential scale reduction factors:

     Point est. Upper C.I.
[1,]       1.07        1.2

I have no idea if this seems okay, according to you reference in the above comment, if the point estimate is < 1.2, this is okay? In the vignette you just say around 1.

Thanks!

ococrook commented 6 years ago

Hi @lmsimp ! The honest answer is assessing convergence is quite challenging and we should be mildly paranoid about convergence. I'd produce some plots of the iterations just to check they are rapidly oscilating around there mean. The Gelman diagnostic is a good tool and if you quote that the statistic was less than 1.2, then everything is probably good as suggest in the paper (though we can never say for certain!).

lgatto commented 6 years ago

@ococrook - could you update the roxygen documentation for freq in mcmc_thin_chains (here), specifying that it should be a numeric (between 0 and 1, I assume), ....

Do you think you could also provide defaults for the extra helper function parameters?