Closed jcesardasilva closed 3 years ago
Hi Julio, thanks for reporting this. First, it is weird that 0fb4d94 is causing you trouble, cause it only added tests. At first glance it looks like the nccl library didn't get installed or wasn't found so looks like that is on us to fix. But we also do no longer include a shortcut for the non-MPI case, so we need to maybe reimplement that.
For short term relief, could you try removing cupy from your venv? or, alternatively, install nccl? Best, Bjoern
another option is to catch that error. Could you try adding
except AttributeError:
pass
after line 160 in multigpu.py
and tell me if that fixed it for you?
I made a new branch pycuda-nccl-patch
... Could you try if that one works for you? @jcesardasilva
Hi @bjoernenders,
Thank you very much for your reactivity and prompt assistance. Apologies that I could not test it yesterday.
I went directly to try the new branch pycuda-nccl-patch
and it worked even without uninstalling cuPy.
I am looking forward to testing the multi-gpu approach one day. For now, the single GPU is working greatly again and it is about 30% faster than the version I had before related to the commit 0fb4d94, which is very good. Thanks.
Just for the info, I only get the warning at the beginning now, but it does not harm since it runs properly afterwards:
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
The numpy
version I am using is the 1.20.2
.
Hi again,
well, it was too good to be true. Now I get nan
after a few iterations of ML. The DM finishes well, but the ML get this error:
Iteration #89 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.79e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
.... in coefficient calculation: 0.06
Iteration #90 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.77e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
.... in coefficient calculation: 0.06
Iteration #91 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.76e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
.... in coefficient calculation: 0.06
Iteration #92 of ML_pycuda :: Time 0.131
Errors :: Fourier 0.00e+00, Photons 7.75e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
.... in coefficient calculation: 0.06
Iteration #93 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
.... in coefficient calculation: 0.06
Iteration #94 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
and this at the end:
/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: divide by zero encountered in double_scalars
self.tmin = dt(-.5 * B[1] / B[2])
/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: invalid value encountered in double_scalars
self.tmin = dt(-.5 * B[1] / B[2])
The ML recon file cannot even be open with ptypy.plot.
Hi Julio,
Well if you look in the log you see that the algorithm becomes unstable and creates NaNs somewhere and that causes issues further down the line.
Did ML work for the same data before? Could you post your script? And could you maybe post a plot for when after DM has finished?
Also you don't benefit much from the GPU implementation if you don't set numiter_contiguous
to a number greater than 1, like 10 or 20.
Hi @bjoernenders
Thank you for your prompt reply and your advice.
In fact, I have not yet seen a big gain in time by setting numiter_contiguous
larger than 1 relative to setting it to 1. The gain is of about 4% for me in NVIDIA gpu tesla v100. I need a very big number such as 50 or more to see something considerable. Am I doing something wrong?
The ML was indeed working before with that same dataset. By before I meant by the beginning of March this year, just before the gpu-hackathon.
Here is the post of my script, i.e., I put my script directly in the zip file since GitHub does not accept .py files ptypy_prep_recons_ID16A_CUDA.py.zip
And here is the plot for when after DM has finished:
Wait, I ran the code with different datasets and I get that NaN in ML only for a few datasets. For the others, which are the majority, it worked properly. So, I am bit puzzled. Possibly I am doing something not right.
It could be that ML might not be stable. Could you retry the exact same script but with both "ML_serial" and "ML" as alternative engines? I would like to know whether the implementation or the data is causing trouble or the data.
Ok, "ML" works flawlessly for the dataset which is problematic for "ML_pycuda". "ML_serial" seems to work well so far too, but it is very slow (20 s per iteration). A silly question, since I may not have understood all, but how should I run ML_serial engines in a fast way?
CORRECTION: "ML_serial" also gets the NaN
error. Sorry for my previous message, but I had to wait quite a bit to see the error appearing.
Therefore, only "ML" works well for this dataset.
During the gpu-hackathon in March we did notice that there where small differences between ML, ML_serial and ML_pycuda when using the regularizer. But we never managed to find the root cause of the differences. @jcesardasilva could you try the same example without the regularizer?
Thank Julio, The "<>_serial" engines are just the blueprints for GPU acceleration. So ideally they should never be used and their poor performance doesn't matter. The GPU engines should be identical in their behavior to the serial ones. That ML and ML serial differ probably requires a bit more investigation to figure out. But I would follow @daurer's advice and see if that helps and the algorithms fall in line. It makes it easier to search for the root cause. Bjoern
Thank you @daurer , thank you @bjoernenders
I tried what @daurer suggested of deactivating rel_del2
and the ML still failed. However, I noticed that I was using scale_precond
set to True
(for some reason I don't remember why), which I deactivated and then it worked.
So, the situation is: I can keep the regularizer set to True
, but I need to deactivate the scale_precond
. Then, the ML calculations work for the previously problematic dataset.
Is this helping to catch or understand better the problem related to the differences between ML_pycuda
and ML
?
@jcesardasilva thanks for this feedback. The pre-conditioner would have been my other guess. Maybe you could create a separate issue for the problems with the pre-conditioner?
Also, I beliebe this issue here is resolved and can be closed?
Hi @daurer,
Indeed, it can be closed and I am doing it now. I am also opening the issue related to the pre-conditioner problem. Thank you very much.
Hi,
Since the commit 0fb4d94, I started having problems to run ptypy on GPU. For the moment I have no plan to use several GPUs, but now I cannot even run the code on a single GPU without getting the following error:
Is it only happening with me or is it something still receiving some tweaks before deployment?
Thank you very much and apologies for the disturbance.