[PyCUDA] Problems with multi_gpu.py

jcesardasilva commented 3 years ago

Hi,

Since the commit 0fb4d94, I started having problems to run ptypy on GPU. For the moment I have no plan to use several GPUs, but now I cannot even run the code on a single GPU without getting the following error:

  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/core/ptycho.py", line 365, in __init__
    self.run()
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/core/ptycho.py", line 695, in run
    self.run(engine=engine)
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/core/ptycho.py", line 617, in run
    engine.initialize()
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/engines/base.py", line 135, in initialize
    self.engine_initialize()
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/cuda_pycuda/engines/DM_pycuda.py", line 71, in engine_initialize
    self.multigpu = get_multi_gpu_communicator()
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/cuda_pycuda/multi_gpu.py", line 156, in get_multi_gpu_communicator
    comm = MultiGpuCommunicatorNccl()
  File "/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/cuda_pycuda/multi_gpu.py", line 119, in __init__
    self.id = nccl.get_unique_id()
AttributeError: '_UnavailableModule' object has no attribute 'get_unique_id'
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------

Is it only happening with me or is it something still receiving some tweaks before deployment?

Thank you very much and apologies for the disturbance.

bjoernenders commented 3 years ago

Hi Julio, thanks for reporting this. First, it is weird that 0fb4d94 is causing you trouble, cause it only added tests. At first glance it looks like the nccl library didn't get installed or wasn't found so looks like that is on us to fix. But we also do no longer include a shortcut for the non-MPI case, so we need to maybe reimplement that.

For short term relief, could you try removing cupy from your venv? or, alternatively, install nccl? Best, Bjoern

bjoernenders commented 3 years ago

another option is to catch that error. Could you try adding

except AttributeError:
    pass

after line 160 in multigpu.py and tell me if that fixed it for you?

bjoernenders commented 3 years ago

I made a new branch pycuda-nccl-patch ... Could you try if that one works for you? @jcesardasilva

jcesardasilva commented 3 years ago

Hi @bjoernenders,

Thank you very much for your reactivity and prompt assistance. Apologies that I could not test it yesterday. I went directly to try the new branch pycuda-nccl-patch and it worked even without uninstalling cuPy.

I am looking forward to testing the multi-gpu approach one day. For now, the single GPU is working greatly again and it is about 30% faster than the version I had before related to the commit 0fb4d94, which is very good. Thanks.

jcesardasilva commented 3 years ago

Just for the info, I only get the warning at the beginning now, but it does not harm since it runs properly afterwards:

RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

The numpy version I am using is the 1.20.2.

jcesardasilva commented 3 years ago

Hi again,

well, it was too good to be true. Now I get nan after a few iterations of ML. The DM finishes well, but the ML get this error:

Iteration #89 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.79e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #90 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.77e+00, Exit 0.00e+00
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #91 of ML_pycuda :: Time 0.115
Errors :: Fourier 0.00e+00, Photons 7.76e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #92 of ML_pycuda :: Time 0.131
Errors :: Fourier 0.00e+00, Photons 7.75e+00, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #93 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...
Time spent in gradient calculation: 0.05
  ....  in coefficient calculation: 0.06
Iteration #94 of ML_pycuda :: Time 0.116
Errors :: Fourier 0.00e+00, Photons nan, Exit 0.00e+00
Warning! inf or nan found! Trying to continue...

and this at the end:

/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: divide by zero encountered in double_scalars
  self.tmin = dt(-.5 * B[1] / B[2])
/data/id16a/inhouse1/sware/py3venvslurm/cudavenv/lib/python3.8/site-packages/Python_Ptychography_toolbox-0.4.1-py3.8-linux-ppc64le.egg/ptypy/accelerate/base/engines/ML_serial.py:237: RuntimeWarning: invalid value encountered in double_scalars
  self.tmin = dt(-.5 * B[1] / B[2])

The ML recon file cannot even be open with ptypy.plot.

bjoernenders commented 3 years ago

Hi Julio, Well if you look in the log you see that the algorithm becomes unstable and creates NaNs somewhere and that causes issues further down the line. Did ML work for the same data before? Could you post your script? And could you maybe post a plot for when after DM has finished? Also you don't benefit much from the GPU implementation if you don't set numiter_contiguous to a number greater than 1, like 10 or 20.

jcesardasilva commented 3 years ago

Hi @bjoernenders

Thank you for your prompt reply and your advice.

In fact, I have not yet seen a big gain in time by setting numiter_contiguous larger than 1 relative to setting it to 1. The gain is of about 4% for me in NVIDIA gpu tesla v100. I need a very big number such as 50 or more to see something considerable. Am I doing something wrong?

The ML was indeed working before with that same dataset. By before I meant by the beginning of March this year, just before the gpu-hackathon.

Here is the post of my script, i.e., I put my script directly in the zip file since GitHub does not accept .py files ptypy_prep_recons_ID16A_CUDA.py.zip

And here is the plot for when after DM has finished:

jcesardasilva commented 3 years ago

Wait, I ran the code with different datasets and I get that NaN in ML only for a few datasets. For the others, which are the majority, it worked properly. So, I am bit puzzled. Possibly I am doing something not right.

bjoernenders commented 3 years ago

It could be that ML might not be stable. Could you retry the exact same script but with both "ML_serial" and "ML" as alternative engines? I would like to know whether the implementation or the data is causing trouble or the data.

jcesardasilva commented 3 years ago

Ok, "ML" works flawlessly for the dataset which is problematic for "ML_pycuda". "ML_serial" seems to work well so far too, but it is very slow (20 s per iteration). A silly question, since I may not have understood all, but how should I run ML_serial engines in a fast way?

jcesardasilva commented 3 years ago

CORRECTION: "ML_serial" also gets the NaN error. Sorry for my previous message, but I had to wait quite a bit to see the error appearing. Therefore, only "ML" works well for this dataset.

daurer commented 3 years ago

During the gpu-hackathon in March we did notice that there where small differences between ML, ML_serial and ML_pycuda when using the regularizer. But we never managed to find the root cause of the differences. @jcesardasilva could you try the same example without the regularizer?

bjoernenders commented 3 years ago

Thank Julio, The "<>_serial" engines are just the blueprints for GPU acceleration. So ideally they should never be used and their poor performance doesn't matter. The GPU engines should be identical in their behavior to the serial ones. That ML and ML serial differ probably requires a bit more investigation to figure out. But I would follow @daurer's advice and see if that helps and the algorithms fall in line. It makes it easier to search for the root cause. Bjoern

jcesardasilva commented 3 years ago

Thank you @daurer , thank you @bjoernenders

I tried what @daurer suggested of deactivating rel_del2 and the ML still failed. However, I noticed that I was using scale_precond set to True (for some reason I don't remember why), which I deactivated and then it worked.

So, the situation is: I can keep the regularizer set to True, but I need to deactivate the scale_precond. Then, the ML calculations work for the previously problematic dataset.

Is this helping to catch or understand better the problem related to the differences between ML_pycuda and ML?

daurer commented 3 years ago

@jcesardasilva thanks for this feedback. The pre-conditioner would have been my other guess. Maybe you could create a separate issue for the problems with the pre-conditioner?

Also, I beliebe this issue here is resolved and can be closed?

jcesardasilva commented 3 years ago

Hi @daurer,

Indeed, it can be closed and I am doing it now. I am also opening the issue related to the pre-conditioner problem. Thank you very much.

ptycho / ptypy

[PyCUDA] Problems with multi_gpu.py #323