jaydu1 / VITAE

Joint Trajectory Inference for Single-cell Genomics Using Deep Learning with a Mixture Prior
https://jaydu1.github.io/VITAE/
MIT License
26 stars 8 forks source link

Running model.init_inference in GPU version failed #4

Closed rmathieu25 closed 3 years ago

rmathieu25 commented 3 years ago

Hello, model.init_inference is very slow to run using the CPU version (but it is running) but I cannot get it to run by using the GPU version.

I get the following error:

# initialize inference
model.init_inference(batch_size=128, 
                     L=150,            # L is the number of MC samples
                     dimred='umap',    # dimension reduction methods
                     #**kwargs         # extra key-value arguments for dimension reduction algorithms.    
                     random_state=seed
                    ) 
# after initialization, we can access some variables by model.pc_x, model.w, model.w_tilde, etc..

Computing posterior estimations over mini-batches.

---------------------------------------------------------------------------

ResourceExhaustedError                    Traceback (most recent call last)

<ipython-input-27-91c48b13b6e4> in <module>()
      4                      dimred='umap',    # dimension reduction methods
      5                      #**kwargs         # extra key-value arguments for dimension reduction algorithms.
----> 6                      random_state=seed
      7                     ) 
      8 # after initialization, we can access some variables by model.pc_x, model.w, model.w_tilde, etc..

10 frames

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

ResourceExhaustedError:  OOM when allocating tensor with shape[128,150,1653,57] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node Tile_1 (defined at /usr/local/lib/python3.7/dist-packages/VITAE/model.py:367) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference__get_inference_4681280]

Function call stack:
_get_inference

I tried to reduce the batch size 64,32,16,8 but all failed. I am not running out of memory.

The is due to the size of the input data. When I reduce the number of cells in my data, it is working.

Thank you in advance.

Best regards

jaydu1 commented 3 years ago

Hi,

The OOM error you get is because your GPU memory cannot load the entire input mini-batch tensor. Here are some ways to get rid of the issue:

  1. Use a smaller batch size B.
  2. Use a smaller MC sample size L.
  3. Use tf.float32 for training while it may not guarantee reproducibility.

Reducing the number of cells may not help unless you only input several cells. That's quite weird. In your case, you could try to set L=50 or even a smaller number. E.g. if your maximum batch size for training is 512, you need to make sure that B*L is about 512.

Thanks.