pycuda: ML_pycuda gets `nan` when using pre-conditioner.

jcesardasilva commented 3 years ago

I get nan after a few iterations of ML_pycuda when I set scale_precond to True. The DM_pycuda finishes well, but the ML_pycuda gets this error:

Iteration #89 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.79e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #90 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.77e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #91 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.76e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #92 of ML_pycuda :: Time 0.131
Errors :: Fourier 0.00e+00, Photons 7.75e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #93 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #94 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...

and this at the end:

/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: divide by zero encountered in double_scalars
  self.tmin = dt(-.5 * B[1] / B[2])
/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: invalid value encountered in double_scalars
  self.tmin = dt(-.5 * B[1] / B[2])

This discussion started with issue #323, which was related to another problem.

daurer commented 2 years ago

@jcesardasilva sorry for the silence on this issue. I have tried to reproduce your problem with our built-in simulations and also with some experimental data and could not find the same issue. Maybe you have a particular data set which reproduces this problem and that you could share with us? Happy to have a closer look...

Generally, ML can be a bit unstable sometimes (for whatever reason) and the problem with the GPU implementation is that we are limited to 64-bit float and 128-bit complex which can make those (numerical) instabilities even worse. However, the code for the pre-conditioner is actually all on the CPU and basically the same in all ML engines (ML, ML_serial and ML_pycuda) so its a bit surprising that you see a problem with ML_pycuda when the pre-conditioner is used (an everything else is the same). Anyway, I would need to have a closer look based on your example to fully understand what is happening in this case.

jcesardasilva commented 2 years ago

@daurer Thank you very much for your answer and my apologies for the delay to reply to you. In fact, I found that the nan error happens either when I set a value to the .illumination.photons or when I use a previous recon file as initial guess. And for my files, the error begins to happen after about 200 ML iterations.
For a normal calculations without a value for photons or an initial guess, it works. It seems indeed a small instability somewhere. I will send you the files to repeat the calculations by the end of the day.

jcesardasilva commented 2 years ago

Hi again, @daurer

Here is the link to get the files: https://cloud.esrf.fr/s/9bQaZgsTKqHGjTD Please, download them before the link get expired. There you find:

sample1.ptyd -> prepared data
sample1recon.ptyr -> reconstructed data which you can also use as initial guess to see the problem
ptypy_NFP_prep_recons_precond_test.py -> ptypy file to carry out the reconstruction The error appears at the iteration 330 of ML:

daurer commented 2 years ago

@jcesardasilva thanks for the test data, I have downloaded the files and I am going to have a look at his...

jcesardasilva commented 2 years ago

And my student have just found that, for Far-field data, we get some artefacts on the corner of the probe image. Below you find the result with p.engines.engine01.scale_precond = True and right after the one with p.engines.engine01.scale_precond = False with_precond without_precond

daurer commented 2 years ago

@jcesardasilva apologies for the delay, I finaly managed to have a look at your NF example and have identified the root of the problem: the power of the initial guess for the illumination. The total power of the probe in your previous reconstruction saved in sample1previousrecon.ptyr is on the order of 1e13 photons, but the average power of the diffraction data that you provided is on the order of 1e10 photons. These 3 orders of magnitude of mismatch between model and data seem to cause big troubles in the regularizer making one of the poly line coefficients eventually switch to np.inf. My guess that this would also happen without the preconditioner, just at a later stage of the reconstruction. It is also worth mentioning that the GPU engines are mostly using single precision, which might further accelerate how soon the regularizer will hit the np.inf value (compared to the original ML engine). But again, the root of the problem is the mismatch in the power of model and data.

When loading a previous reconstruction as the starting guess of another reconstruction, there are a few things to consider. As you have figured out, the easiest way to load a previous probe is the following:

p.scans.ID16A.illumination = "./sample1previousrecon.ptyr"

The issue with this approach is that it will estimate the power of the probe model based on the provided reconstruction and this could lead to a mismatch with the data (as described above). ptypy actually gives a warning in this case:

WARNING ptypy - Photon count from input parameters (2.00e+13) differs from statistics (3.03e+10) by more than a magnitude

The better option is to give a full description of the illumination paramteres, which allows to make some adjustments:

p.scans.ID16A.illumination = u.Param()                                                           
p.scans.ID16A.illumination.model = "recon"                                                         
p.scans.ID16A.illumination.recon = u.Param()                                                            
p.scans.ID16A.illumination.recon.rfile = "./sample1previousrecon.ptyr" 
p.scans.ID16A.illumination.photons = None # this makes sure that the power of the model is estimated from the data
p.scans.ID16A.illumination.aperture = u.Param() 
p.scans.ID16A.illumination.aperture.form = None # this is necessary, otherwise you will end up with a circular aperture/mask on top of your initial illumination

with this setup, I got a decent looking reconstruction with 500 iterations of DM_pycuda and 500 iterations of ML_pycuda.

I hope this was helpful. I will close this issue as it seems that there is nothing to be fixed in the preconditioner.

P.S.: For the unrelated far-field issue reported above, please open a separate ticket and if possible provide some example code/data...

ptycho / ptypy

pycuda: ML_pycuda gets `nan` when using pre-conditioner. #325