Large dataset loading and memory handling issue

ltang320 commented 1 year ago

For a dataset with 47,206 frames, I prepare cropping 256x256 pixels for each diffraction pattern and analyze it with 4 modes. This means it generates 188,824 pods in the memory. Loading the data is no problem and I use p.frames_per_block = 1000 to load it separately.

However, when it starts doing the reconstruction, there is the following problem generated which is related to the memory issue: ` ==== Starting DM_pycuda-algorithm. =============================================

  Parameter set:
  * id1B0E8R2FKG           : ptypy.utils.parameters.Param(25)
    * numiter              : 500
    * numiter_contiguous   : 100
    * probe_support        : 3
    * probe_fourier_sup... : None
    * record_local_error   : False
    * position_refinement  : ptypy.utils.parameters.Param(10)
      * method             : Annealing
      * start              : None
      * stop               : None
      * interval           : 1
      * nshifts            : 4
      * amplitude          : 1e-06
      * amplitude_decay    : True
      * max_shift          : 2e-06
      * metric             : fourier
      * record             : False
    * probe_update_start   : 2
    * subpix_start         : 0
    * subpix               : linear
    * update_object_first  : True
    * overlap_converge_... : 0.05
    * overlap_max_itera... : 10
    * probe_inertia        : 1e-09
    * object_inertia       : 0.0001
    * fourier_power_bound  : None
    * fourier_relax_factor : 0.05
    * obj_smooth_std       : None
    * clip_object          : None
    * probe_center_tol     : None
    * compute_log_likel... : True
    * probe_update_cuda... : False
    * object_update_cud... : True
    * fft_lib              : reikna
    * alpha                : 1.0
    * name                 : DM_pycuda
  ================================================================================
      P.run()
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 763, in run
      self.run(engine=engine)
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 649, in run
      engine.initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/engines/base.py", line 153, in initialize
      self.engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda_stream.py", line 49, in engine_initialize
      super().engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda.py", line 72, in engine_initialize
      self.context, self.queue = get_context(new_context=True, new_queue=False)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/__init__.py", line 28, in get_context
      raise Exception('Local rank must be smaller than total device count, \
  Exception: Local rank must be smaller than total device count,                 rank=2, rank_local=2, device_count=2
  Traceback (most recent call last):
    File "/asap3/petra3/gpfs/p06/2023/data/11016011/processed/macros/rec_multi_2x2_GPU_init_TL.py", line 275, in <module>
      P.run()
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 763, in run
      self.run(engine=engine)
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 649, in run
      engine.initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/engines/base.py", line 153, in initialize
      self.engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda_stream.py", line 49, in engine_initialize
      super().engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda.py", line 95, in engine_initialize
      super().engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/base/engines/projectional_serial.py", line 137, in engine_initialize
      self._setup_kernels()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda_stream.py", line 55, in _setup_kernels
      super()._setup_kernels()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda.py", line 146, in _setup_kernels
      kern.PROP.allocate()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/kernels.py", line 64, in allocate
      self._fft1 = FFT(self._tmp, self.queue,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/fft.py", line 79, in __init__
      self._set_stream(thr)
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/fft.py", line 94, in _set_stream
      self._ftreikna = self._ftreikna_raw.compile(thr)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/core/computation.py", line 207, in compile
  Traceback (most recent call last):
    File "/asap3/petra3/gpfs/p06/2023/data/11016011/processed/macros/rec_multi_2x2_GPU_init_TL.py", line 275, in <module>
      P.run()
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 763, in run
      self.run(engine=engine)
    File "/home/litang/ptypy_p06/build/lib/ptypy/core/ptycho.py", line 649, in run
      engine.initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/engines/base.py", line 153, in initialize
      self.engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda_stream.py", line 49, in engine_initialize
      super().engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda.py", line 95, in engine_initialize
      super().engine_initialize()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/base/engines/projectional_serial.py", line 137, in engine_initialize
      self._setup_kernels()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda_stream.py", line 55, in _setup_kernels
      super()._setup_kernels()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/engines/projectional_pycuda.py", line 146, in _setup_kernels
      kern.PROP.allocate()
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/kernels.py", line 64, in allocate
      self._fft1 = FFT(self._tmp, self.queue,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/fft.py", line 79, in __init__
      self._set_stream(thr)
    File "/home/litang/ptypy_p06/build/lib/ptypy/accelerate/cuda_pycuda/fft.py", line 94, in _set_stream
      self._ftreikna = self._ftreikna_raw.compile(thr)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/core/computation.py", line 207, in compile
      self._tr_tree, translator, thread, fast_math, compiler_options, keep).finalize()
                                                                            ^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/core/computation.py", line 557, in finalize
      self._tr_tree, translator, thread, fast_math, compiler_options, keep).finalize()
                                                                            ^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/core/computation.py", line 557, in finalize
      new_buf = self._thread.temp_array(
                ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/api.py", line 403, in temp_array
      new_buf = self._thread.temp_array(
                ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/api.py", line 403, in temp_array
      return self.temp_alloc.array(
             ^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 79, in array
      return self.temp_alloc.array(
             ^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 79, in array
      self._allocate(new_id, array.nbytes, dependencies, self._pack_on_alloc)
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 181, in _allocate
      self._allocate(new_id, array.nbytes, dependencies, self._pack_on_alloc)
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 181, in _allocate
      self._fast_add(new_id, size, dep_set)
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 205, in _fast_add
      self._fast_add(new_id, size, dep_set)
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/tempalloc.py", line 205, in _fast_add
      buf = self._thr.allocate(size)
            ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 168, in allocate
      buf = self._thr.allocate(size)
            ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 168, in allocate
      return Buffer(size)
             ^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 60, in __init__
      return Buffer(size)
             ^^^^^^^^^^^^
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 60, in __init__
      self._buffer = cuda.mem_alloc(size)
                     ^^^^^^^^^^^^^^^^^^^^
  pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
      self._buffer = cuda.mem_alloc(size)
                     ^^^^^^^^^^^^^^^^^^^^
  pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
  Exception ignored in: <function Buffer.__del__ at 0x2abdc11632e0>
  Traceback (most recent call last):
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 70, in __del__
      self._buffer.free()
      ^^^^^^^^^^^^
  AttributeError: 'Buffer' object has no attribute '_buffer'
  Exception ignored in: <function Buffer.__del__ at 0x2b046c1af2e0>
  Traceback (most recent call last):
    File "/home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/reikna/cluda/cuda.py", line 70, in __del__
      self._buffer.free()
      ^^^^^^^^^^^^
  AttributeError: 'Buffer' object has no attribute '_buffer'
  -------------------------------------------------------------------
  PyCUDA ERROR: The context stack was not empty upon module cleanup.
  -------------------------------------------------------------------
  A context was still active when the context stack was being
  cleaned up. At this point in our execution, CUDA may already
  have been deinitialized, so there is no way we can finish
  cleanly. The program will be aborted now.
  Use Context.pop() to avoid this problem.
  -------------------------------------------------------------------
  [max-exflg031:25282] *** Process received signal ***
  [max-exflg031:25282] Signal: Aborted (6)
  [max-exflg031:25282] Signal code:  (-6)
  [max-exflg031:25282] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aba8f9a9630]
  [max-exflg031:25282] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aba904fd387]
  [max-exflg031:25282] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aba904fea78]
  [max-exflg031:25282] [ 3] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(+0xca7eb)[0x2abcbc3577eb]
  [max-exflg031:25282] [ 4] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(_ZN5boost19thread_specific_ptrIN6pycuda13context_stackEE15default_deleterEPS2_+0xf)[0x2abcbc3577ff]
  [max-exflg031:25282] [ 5] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/../../../libboost_thread.so.1.78.0(_ZN5boost6detail12set_tss_dataEPKvPFvPFvPvES3_ES5_S3_b+0x62)[0x2abcbc44efd2]
  [max-exflg031:25282] [ 6] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(_ZN5boost19thread_specific_ptrIN6pycuda13context_stackEED1Ev+0x16)[0x2abcbc356726]
  [max-exflg031:25282] [ 7] /lib64/libc.so.6(+0x39ce9)[0x2aba90500ce9]
  [max-exflg031:25282] [ 8] /lib64/libc.so.6(+0x39d37)[0x2aba90500d37]
  [max-exflg031:25282] [ 9] /lib64/libc.so.6(__libc_start_main+0xfc)[0x2aba904e955c]
  [max-exflg031:25282] [10] python(+0x29335d)[0x55b721fc735d]
  [max-exflg031:25282] *** End of error message ***
  -------------------------------------------------------------------
  PyCUDA ERROR: The context stack was not empty upon module cleanup.
  -------------------------------------------------------------------
  A context was still active when the context stack was being
  cleaned up. At this point in our execution, CUDA may already
  have been deinitialized, so there is no way we can finish
  cleanly. The program will be aborted now.
  Use Context.pop() to avoid this problem.
  -------------------------------------------------------------------
  [max-exflg031:25281] *** Process received signal ***
  [max-exflg031:25281] Signal: Aborted (6)
  [max-exflg031:25281] Signal code:  (-6)
  [max-exflg031:25281] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b012ce90630]
  [max-exflg031:25281] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b012d9e4387]
  [max-exflg031:25281] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b012d9e5a78]
  [max-exflg031:25281] [ 3] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(+0xca7eb)[0x2b035983c7eb]
  [max-exflg031:25281] [ 4] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(_ZN5boost19thread_specific_ptrIN6pycuda13context_stackEE15default_deleterEPS2_+0xf)[0x2b035983c7ff]
  [max-exflg031:25281] [ 5] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/../../../libboost_thread.so.1.78.0(_ZN5boost6detail12set_tss_dataEPKvPFvPFvPvES3_ES5_S3_b+0x62)[0x2b0359933fd2]
  [max-exflg031:25281] [ 6] /home/litang/.conda/envs/ptypy_p06/lib/python3.11/site-packages/pycuda/_driver.cpython-311-x86_64-linux-gnu.so(_ZN5boost19thread_specific_ptrIN6pycuda13context_stackEED1Ev+0x16)[0x2b035983b726]
  [max-exflg031:25281] [ 7] /lib64/libc.so.6(+0x39ce9)[0x2b012d9e7ce9]
  [max-exflg031:25281] [ 8] /lib64/libc.so.6(+0x39d37)[0x2b012d9e7d37]
  [max-exflg031:25281] [ 9] /lib64/libc.so.6(__libc_start_main+0xfc)[0x2b012d9d055c]
  [max-exflg031:25281] [10] python(+0x29335d)[0x55acc70e235d]
  [max-exflg031:25281] *** End of error message ***
  --------------------------------------------------------------------------
  Primary job  terminated normally, but 1 process returned
  a non-zero exit code. Per user-direction, the job has been aborted.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun detected that one or more processes exited with non-zero status, thus causing
  the job to be terminated. The first process to do so was:

    Process name: [[11660,1],2]
    Exit code:    1
  --------------------------------------------------------------------------

` Could you help me with this problem? Thanks a lot! @daurer @bjoernenders

daurer commented 1 year ago

@ltang320 Thanks for sharing the details and sorry that we could not have a look at this during the developer meeting. Could you try with a smaller frames_per_block something like 100 or even 10. This will break up the data and model into smaller chunks and should help avoiding the GPU memory issues that you have reported above.

The DM_pycuda engine is capable of handling large data that does not all fit into the memory of a single GPU by using CUDA streams, but a single chunk (frames_per_block) still needs to fit into GPU memory therefore reducing this parameter should help.

ltang320 commented 1 year ago

@ltang320 Thanks for sharing the details and sorry that we could not have a look at this during the developer meeting. Could you try with a smaller frames_per_block something like 100 or even 10. This will break up the data and model into smaller chunks and should help avoiding the GPU memory issues that you have reported above.

The DM_pycuda engine is capable of handling large data that does not all fit into the memory of a single GPU by using CUDA streams, but a single chunk (frames_per_block) still needs to fit into GPU memory therefore reducing this parameter should help.

@daurer Thanks for your reply. It works for frames_per_block = 15. From my current testing, the minimum frame_per_block should be above 10. For each pods, the memory is smaller than 500MB and it could work properly.

Thanks a lot for help from both of you. @bjoernenders @daurer

ptycho / ptypy

Large dataset loading and memory handling issue #505