nanograv / PTMCMCSampler

Parallel tempering MCMC sampler package written in Python
Other
50 stars 38 forks source link

failure to resume from chain file #53

Closed jeremy-baier closed 5 months ago

jeremy-baier commented 8 months ago

specifically with parallel tempering, I am getting failures to start sampling (both resuming and starting a new job) with the following error message: File "/home/baierj/miniconda3/envs/custom_noise/lib/python3.9/site-packages/PTMCMCSampler/PTMCMCSampler.py", line 303, in initialize raise Exception( Exception: Old chain has 21 rows, which is not the initial sample plus a multiple of isave/thin = 100 I am using the most up-to-date master version of PTMCMCsampler installed from git. Weirdly, I cannot replicate this error consistently. It just happens for some jobs but not for others.

kdolum commented 8 months ago

@jeremy-baier, Do you get this error even when you set resume=False or leave it unset? It's hard to understand how this can happen, because the message is printed in a block beginning if self.resume and .... If you can reproduce the problem, could you print the value of self.resume at the beginning of this block? Thanks.

jeremy-baier commented 5 months ago

Hi Ken, I wanted to follow up on this. I still have been having this issue and cannot figure out why. I do not experience this with resume=False. Can you help me understand exactly what this code block is trying to do anyways? It seems to be making sure that the save file length matches the expected length give particular values of isave and thin. But why is this important? Thanks, —jeremy

kdolum commented 5 months ago

Hi, Jeremy. If you're starting a new run, you should of course either say resume=False or start in an empty directory. Then presumably this won't happen. If you're actually resuming, you should set isave and thin to the same values as the run you are resuming. Then this shouldn't happen either, and if it does, we will have to debug it. One thing that would be useful is to look at the number of rows in the chain files before you resume and see if it corresponds to what it says in the error message. One reason you might legitimately get this error is that your previous run crashed in the middle of writing out a block in the chain file, and so it is only partly written out. The previous code tried to edit your chain file in this case, but that seemed dangerous to me, so I raise an exception and you can edit the file yourself. But it does not seem to me like this is your problem. The reason this code is there is that we don't know what settings were used for the previous run that we are resuming. It would be a mess, for example, to change thin before resuming. Then your file would have different samples representing different amounts of the actual MCMC run. So the code checks that the old chain file is consistent with having been run with the same settings. Is there any possibility that more than one run could be using the same directory by mistake? That would naturally cause unpredictable results.

jeremy-baier commented 5 months ago

Thanks for the reply, Ken. I can confirm that I am saving different runs to different directories and there should not be any issues there. I have not been playing with the values of isave and ithin so I don’t think that is the case either. In terms of crashing mid-run, I have just been using parallel tempering on an hpc and using scancel to stop jobs. I am not sure if there is a nicer way to ask the sampler to stop. After a little bit more digging, I think this might be related to PR!54( https://github.com/nanograv/PTMCMCSampler/pulls ). I have been using hotchains and I have been writing them everytime (can confirm that they are being output in the directory). So I am still not sure why the resume is an issue.

kdolum commented 5 months ago

OK. I don't think it's #54, because that had to do with not writing hot chains. So let's try to find the bug. Could you start a run, then cancel it as you said, then look at the number of rows in all your chain files (e.g., with "wc -l")? If the number of rows is not one plus a multiple of 100, let me know and we'll try to understand how that occurs. If every file does indeed have the form 100n+1, try resuming and see if it works. Just to check, you are asking for the total number of samples that is a multiple of 1000, right?

jeremy-baier commented 5 months ago

Ok Ken, I think I have tracked down what is going on. The runs that are crashing are runs where the sampler only has the initial sample written to file. (That is, the sampler has not gotten far enough to checkpoint even once.) So when the Sampler tries to resume, it loads back in the chain file as a 1d array rather than a 2d array (since there is only the initial sample written to file). This gets caught in the block you added because the ResumeLength is no longer the length of the chain, but the resumelength incorrectly gets set to the number of parameters+4. (So then if I comment out your block, this line breaks: https://github.com/nanograv/PTMCMCSampler/blob/98110732aa5b031daab254f3abefc4b80fda4487/PTMCMCSampler/PTMCMCSampler.py#L475 because your indexing dimensionality is wrong since you have loaded in a 1d array). I think this could be solved by checking the dimensionality of the chain when it gets loaded in. But let me know what you think makes the most sense for a fix. I am a bit surprised that other people have not run into this issue before. Does this mean that my models are really slow getting started?? (I checked and it was about ~40 minutes to get to the first check point for some cases.) Either way, I would be happy to help put in a PR to fix this. Let me know if my explanation is coherent and sounds right to you! Thanks, —jeremy

jeremy-baier commented 5 months ago

This also explains why I was not able to consistently replicate the error. It was only happening for the jobs that were slow getting started.

kdolum commented 5 months ago

Thanks, Jeremy. Good catch! In my opinion python is too willing to muddle the difference between different shapes of arrays with the same data. I think you can fix this by passing ndmin = 2 to np.loadtxt. Please go ahead.

jeremy-baier commented 5 months ago

That sounds like a good fix!

kdolum commented 5 months ago

Fixed by #55