Closed obarnstedt closed 3 years ago
Hi @obarnstedt sorry for the late reply - let's tackle these one at a time. The out of memory error shows that your GPU already has 4.5 GB allocated to other processes - do you have another instance of pytorch running, for example? The code we use to check the size of a model is pretty rudimentary and only computes the total memory footprint of the model, and does not check to see if that much space is actually available.
You could try killing the other processes running on the GPU, and/or reducing the batch size (I usually use a batch size between 100-200). How large are your images in pixels?
Hi @themattinthehatt , thanks for your reply! I'm not running any other processes on the GPU; in fact I have tried many times now to restart and run this code right after restart and activating the environment, to no avail. I have also tried lowering the "approx_batch_size" in the params.json down to 10 now, without any effect on the error messages (it always tries to allocate 2.71GB on top of 4.46 already occupied). Are there any other parameters that would affect memory allocation? Or would the only chance be to get hold of a larger GPU? Images are 782x582, but I could of course downsample... I am also still wondering about the fact it doesn't seem to recognise the GPU some steps before, because if I check PyTorch CUDA GPU access separately, it has no trouble detecting the GPU. Thanks, Oliver
Hi, downsampling images to 391x291 unfortunately leads to another error message, related to data dimensions:
epoch 0000/1000
0%| | 0/24 [00:00<?, ?it/s]using data from following sessions:
/mnt/ag-remy-2/Imaging/OB/Results/behavenet/remy/3/207/103
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
remy_3_207_103
signals: ['images', 'labels']
transforms: OrderedDict([('images', None), ('labels', None)])
paths: OrderedDict([('images', '/mnt/ag-remy-2/Imaging/OB/Data/Lavision_2018/behavenet/remy/3/207/103/data.hdf5'), ('labels', '/mnt/ag-remy-2/Imaging/OB/Data/Lavision_2018/behavenet/remy/3/207/103/data.hdf5')])
Caught exception in worker thread size mismatch, m1: [200 x 10240], m2: [40960 x 4] at /tmp/pip-req-build-ufslq_a9/aten/src/THC/generic/THCTensorMathBlas.cu:290
Traceback (most recent call last):
File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "behavenet/fitting/ae_grid_search.py", line 109, in main
fit(hparams, model, data_generator, exp, method='ae')
File "/home/oliver/Git/behavenet/behavenet/fitting/training.py", line 345, in fit
loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True)
File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 886, in loss
x_hat, _ = self.forward(x_in, labels=y_in, labels_2d=y_2d_in, dataset=dataset)
File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 834, in forward
x, pool_idx, outsize = self.encoding(x, dataset=dataset)
File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/oliver/Git/behavenet/behavenet/models/aes.py", line 218, in forward
return self.FF(x), pool_idx, target_output_size
File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/oliver/anaconda3/envs/behavenet/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [200 x 10240], m2: [40960 x 4] at /tmp/pip-req-build-ufslq_a9/aten/src/THC/generic/THCTensorMathBlas.cu:290
The approx_batch_size
parameter is unfortunately a bit misleading - this is only used for the calculation of the model's (approximate) memory footprint on the GPU, not the acutal batch size. The actual batch size is determined through the construction of the hdf5, and the number of frames you allocate per "trial".
Your original images are quite large, at least compared to what I typically use (~200x200). Downsampling should definitely help, but it appears that might have led to this new error. Did you update the x_pix
and y_pix
params in the data json to reflect the new image sizes?
Hi, I downsampled images to 192x258, with 300 frames per trial, and corrected the "y_pixels" and "x_pixels" in the json, and the training is now running fine, using about 6GB on the GPU. Thanks for your help! Oliver
Hi all, first of all, thanks everyone involved for sharing this package! I have followed the tutorials step by step, and would like to train an autoencoder on my data, but unfortunately my training aborts (well, it becomes idle) in the first training epoch with the following error message:
As it says, I have an 8GB Geforce RTX 2080 that has been handling most ML tasks fairly well. I have tried decreasing batch size to 200 in the experiment json and set "tt_n_gpu_trials" to 400 (from 1000) and "mem_limit_gb" to 6.0 (from 8.0) in the ae_compute.json, but still the same memory error crops up. Does anyone have any suggestions how to work around this rather than buying a new GPU? I should probably also add that beforehand, another error message occurs (before training, during model construction):
Seeing that pytorch 1.4.0 is used, which is apparently not compatible with CUDA 10.0, I have upgraded to CUDA 10.1, but the error still appears twice during model construction, without actually exiting. I would be grateful for any help or pointers! Oliver