themattinthehatt / behavenet

Toolbox for analyzing behavioral videos and neural activity
https://behavenet.readthedocs.io/
MIT License
57 stars 15 forks source link

AE training CUDA out of memory with 8GB GPU #24

Closed obarnstedt closed 3 years ago

obarnstedt commented 3 years ago

Hi all, first of all, thanks everyone involved for sharing this package! I have followed the tutorials step by step, and would like to train an autoencoder on my data, but unfortunately my training aborts (well, it becomes idle) in the first training epoch with the following error message:

Caught exception in worker thread CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 7.79 GiB total capacity; 4.45 GiB already allocated; 1.91 GiB free; 4.47 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "behavenet/fitting/ae_grid_search.py", line 109, in main
    fit(hparams, model, data_generator, exp, method='ae')
  File "/home/oliver/Git/behavenet/behavenet/fitting/training.py", line 345, in fit
    loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 761, in loss
    x_hat, _ = self.forward(x_in, dataset=dataset)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 713, in forward
    x, pool_idx, outsize = self.encoding(x, dataset=dataset)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 211, in forward
    x = layer(x)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 559, in forward
    return F.leaky_relu(input, self.negative_slope, self.inplace)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/functional.py", line 1063, in leaky_relu
    result = torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 7.79 GiB total capacity; 4.45 GiB already allocated; 1.91 GiB free; 4.47 GiB reserved in total by PyTorch)

As it says, I have an 8GB Geforce RTX 2080 that has been handling most ML tasks fairly well. I have tried decreasing batch size to 200 in the experiment json and set "tt_n_gpu_trials" to 400 (from 1000) and "mem_limit_gb" to 6.0 (from 8.0) in the ae_compute.json, but still the same memory error crops up. Does anyone have any suggestions how to work around this rather than buying a new GPU? I should probably also add that beforehand, another error message occurs (before training, during model construction):

THCudaCheck FAIL file=/tmp/pip-req-build-ufslq_a9/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected
constructing model...Caught exception in worker thread cuda runtime error (100) : no CUDA-capable device is detected at /tmp/pip-req-build-ufslq_a9/aten/src/THC/THCGeneral.cpp:50
Traceback (most recent call last):
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "behavenet/fitting/ae_grid_search.py", line 84, in main
    model.to(hparams['device'])
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 425, in to
    return self._apply(convert)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 223, in _apply
    param_applied = fn(param)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 423, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/cuda/__init__.py", line 197, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /tmp/pip-req-build-ufslq_a9/aten/src/THC/THCGeneral.cpp:50

Seeing that pytorch 1.4.0 is used, which is apparently not compatible with CUDA 10.0, I have upgraded to CUDA 10.1, but the error still appears twice during model construction, without actually exiting. I would be grateful for any help or pointers! Oliver

themattinthehatt commented 3 years ago

Hi @obarnstedt sorry for the late reply - let's tackle these one at a time. The out of memory error shows that your GPU already has 4.5 GB allocated to other processes - do you have another instance of pytorch running, for example? The code we use to check the size of a model is pretty rudimentary and only computes the total memory footprint of the model, and does not check to see if that much space is actually available.

You could try killing the other processes running on the GPU, and/or reducing the batch size (I usually use a batch size between 100-200). How large are your images in pixels?

obarnstedt commented 3 years ago

Hi @themattinthehatt , thanks for your reply! I'm not running any other processes on the GPU; in fact I have tried many times now to restart and run this code right after restart and activating the environment, to no avail. I have also tried lowering the "approx_batch_size" in the params.json down to 10 now, without any effect on the error messages (it always tries to allocate 2.71GB on top of 4.46 already occupied). Are there any other parameters that would affect memory allocation? Or would the only chance be to get hold of a larger GPU? Images are 782x582, but I could of course downsample... I am also still wondering about the fact it doesn't seem to recognise the GPU some steps before, because if I check PyTorch CUDA GPU access separately, it has no trouble detecting the GPU. Thanks, Oliver

obarnstedt commented 3 years ago

Hi, downsampling images to 391x291 unfortunately leads to another error message, related to data dimensions:

epoch 0000/1000
  0%|                                                                                                                                   | 0/24 [00:00<?, ?it/s]using data from following sessions:
/mnt/ag-remy-2/Imaging/OB/Results/behavenet/remy/3/207/103
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
remy_3_207_103
    signals: ['images', 'labels']
    transforms: OrderedDict([('images', None), ('labels', None)])
    paths: OrderedDict([('images', '/mnt/ag-remy-2/Imaging/OB/Data/Lavision_2018/behavenet/remy/3/207/103/data.hdf5'), ('labels', '/mnt/ag-remy-2/Imaging/OB/Data/Lavision_2018/behavenet/remy/3/207/103/data.hdf5')])

Caught exception in worker thread size mismatch, m1: [200 x 10240], m2: [40960 x 4] at /tmp/pip-req-build-ufslq_a9/aten/src/THC/generic/THCTensorMathBlas.cu:290
Traceback (most recent call last):
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "behavenet/fitting/ae_grid_search.py", line 109, in main
    fit(hparams, model, data_generator, exp, method='ae')
  File "/home/oliver/Git/behavenet/behavenet/fitting/training.py", line 345, in fit
    loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 886, in loss
    x_hat, _ = self.forward(x_in, labels=y_in, labels_2d=y_2d_in, dataset=dataset)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 834, in forward
    x, pool_idx, outsize = self.encoding(x, dataset=dataset)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 218, in forward
    return self.FF(x), pool_idx, target_output_size
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [200 x 10240], m2: [40960 x 4] at /tmp/pip-req-build-ufslq_a9/aten/src/THC/generic/THCTensorMathBlas.cu:290
themattinthehatt commented 3 years ago

The approx_batch_size parameter is unfortunately a bit misleading - this is only used for the calculation of the model's (approximate) memory footprint on the GPU, not the acutal batch size. The actual batch size is determined through the construction of the hdf5, and the number of frames you allocate per "trial".

Your original images are quite large, at least compared to what I typically use (~200x200). Downsampling should definitely help, but it appears that might have led to this new error. Did you update the x_pix and y_pix params in the data json to reflect the new image sizes?

obarnstedt commented 3 years ago

Hi, I downsampled images to 192x258, with 300 frames per trial, and corrected the "y_pixels" and "x_pixels" in the json, and the training is now running fine, using about 6GB on the GPU. Thanks for your help! Oliver