popgenmethods / SINGER

Sampling and inference of genealogies with recombination
MIT License
22 stars 4 forks source link

`-resume` flag fails because `ARG.bin_size` is never set #13

Open nspope opened 5 months ago

nspope commented 5 months ago

Currently using -resume triggers the assertion: https://github.com/popgenmethods/SINGER/blob/23469d38b7a7a1e431e65c1d018023c67e096420/SINGER/SINGER/BSP.cpp#L121

AFAICT the underlying reason is that when using the -resume flag, the array ARG.rhos contains only zeros. This is because ARG::compute_rhos_thetas multiplies the recombination rate by ARG.bin_size here: https://github.com/popgenmethods/SINGER/blob/23469d38b7a7a1e431e65c1d018023c67e096420/SINGER/SINGER/ARG.cpp#L59

... and ARG.bin_size is initialized to 0 but never set during: https://github.com/popgenmethods/SINGER/blob/23469d38b7a7a1e431e65c1d018023c67e096420/SINGER/SINGER/Sampler.cpp#L766

(whereas, when running from the top, Sampler::build_singleton_arg sets ARG.bin_size here).

It'd be awesome to fix this; currently the inability to resume MCMC after hitting a numerical issue really slows things down.

YunDeng98 commented 5 months ago

Hi @nspope , sorry about this. I will fix this right away. I am pulling together a new 0.1.8 version to incorporate other fixes of some bugs. But I will work on this right away.

nspope commented 5 months ago

No worries, thanks for the quick response!

nspope commented 5 months ago

Also if I could make a quick suggestion -- maybe in singer_master, there could be a flag --auto-resume or something similar, that adds the -resume flag to: https://github.com/popgenmethods/SINGER/blob/23469d38b7a7a1e431e65c1d018023c67e096420/SINGER/SINGER/singer_master#L27

so that the restarts can pick up from the last iteration rather than from the top?

YunDeng98 commented 5 months ago

@nspope try now the 0.1.8 version and I think the "empty bin"problem should be gone. Version 0.1.8 isn't finished on all the fixes yet, but this specific problem should be resolved.

nspope commented 5 months ago

Thanks Yun! So use the main branch from github.com/yundeng98/SINGER in the interim?

YunDeng98 commented 5 months ago

The suggestion I have for now is that, as you can tell, singer_master is all about resuming when things failed, and make that into an automatic workflow. However, sometimes it behaved outside my imagination. In this project of yours, you probably have more ideas than me about what is needed. So maybe go ahead and build your own workflow, so that my slow fixes won't delay you. I will let you know when 0.1.8 is completely finished and by then the control logic should be improved, including the resuming from last point logic. Just look for the binaries in releases/singer-0.1.8xxxxxx, and the main branch it is.

nspope commented 4 months ago

I'm sure you've got your own preferred way to do this ... but FWIW the most stable approach I've found is to restart from the second-to-last MCMC iteration, as in this Snakemake workflow.

The reason to not restart at the current MCMC iteration is that there are (rarely) instances where the ARG from the last iteration is invalid (seems like this might be floating point error wherein child age > parent age?). So, it works to delete the last "rethread" line of the log and restart from there.

YunDeng98 commented 4 months ago

Thanks @nspope I believe you are right, and something must be sub-optimal with the ARG re-scaling that causes the numerical issue of child/parent node age problem. I have enabled a function to set back a given number of iterations back, and it seems that I should simply set the number to be 1? I can go ahead and implement this into a new singer_master.

nspope commented 4 months ago

This has worked without a hitch for me, so seems like a good plan! Sounds like you've already got something implemented, but feel free to adapt the code snippet in that snakemake workflow if it looks useful.