Running Code with Multiple GPUs

faezeamin commented 7 months ago

Thank you for providing the code!

I'd like to run it using multiple GPUs with my own dataset, but I encountered the following error:

—---

DATA CONFIG: lab: han expt: jenelia-exp animal: HH09 session: S07_20210611 n_input_channels: 2 y_pixels: 304 x_pixels: 288 use_output_mask: False frame_rate: 30.0 neural_type: ca neural_bin_size: 0.03333333333333333 approx_batch_size: 200

COMPUTE CONFIG: device: cuda n_parallel_gpus: 4 gpus_viz: 0;1;2;3 tt_n_gpu_trials: 128 tt_n_cpu_trials: 1000 tt_n_cpu_workers: 5 mem_limit_gb: 7

TRAINING CONFIG: export_train_plots: True export_latents: True pretrained_weights_path: None val_check_interval: 1 learning_rate: 0.0001 max_n_epochs: 1000 min_n_epochs: 10 enable_early_stop: False early_stop_history: 10 rng_seed_train: None as_numpy: False batch_load: True rng_seed_data: 0 train_frac: 1.0 trial_splits: 8;1;1;0

MODEL CONFIG: experiment_name: dim_search model_type: conv n_ae_latents: 16 l2_reg: 0.0 rng_seed_model: 0 fit_sess_io_layers: False ae_arch_json: None model_class: ae conditional_encoder: False msp.alpha: None vae.beta: 1 vae.beta_anneal_epochs: 100 beta_tcvae.beta: 1 beta_tcvae.beta_anneal_epochs: 100 ps_vae.alpha: 1 ps_vae.beta: 1 ps_vae.gamma: 1 ps_vae.delta: 1 ps_vae.anneal_epochs: 100 n_background: 3 n_sessions_per_batch: 1

using data from following sessions: /root/capsule/scratch/results/han/jenelia-exp/HH09/S07_20210611 constructing data generator...done Generator contains 1 SingleSessionDatasetBatchedLoad objects: han_jenelia-exp_HH09_S07_20210611 signals: ['images'] transforms: OrderedDict([('images', None)]) paths: OrderedDict([('images', '/root/capsule/data/base-data-dir/han/jenelia-exp/HH09/S07_20210611/data.hdf5')])

constructing model...Initializing with random weights done CustomDataParallel( (module): AE( (encoding): ConvAEEncoder( (encoder): ModuleList( (zero_pad0): ZeroPad2d((1, 2, 1, 2)) (conv0): Conv2d(2, 32, kernel_size=(5, 5), stride=(2, 2)) (relu0): LeakyReLU(negative_slope=0.05) (zero_pad1): ZeroPad2d((1, 2, 1, 2)) (conv1): Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2)) (relu1): LeakyReLU(negative_slope=0.05) (zero_pad2): ZeroPad2d((1, 2, 1, 2)) (conv2): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2)) (relu2): LeakyReLU(negative_slope=0.05) (zero_pad3): ZeroPad2d((1, 2, 1, 2)) (conv3): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2)) (relu3): LeakyReLU(negative_slope=0.05) (zero_pad4): ZeroPad2d((1, 1, 0, 1)) (conv4): Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5)) (relu4): LeakyReLU(negative_slope=0.05) ) (FF): Linear(in_features=8192, out_features=16, bias=True) ) (decoding): ConvAEDecoder( (FF): Linear(in_features=16, out_features=8192, bias=True) (decoder): ModuleList( (convtranspose0): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5)) (relu0): LeakyReLU(negative_slope=0.05) (convtranspose1): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2)) (relu1): LeakyReLU(negative_slope=0.05) (convtranspose2): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2)) (relu2): LeakyReLU(negative_slope=0.05) (convtranspose3): ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2)) (relu3): LeakyReLU(negative_slope=0.05) (convtranspose4): ConvTranspose2d(32, 2, kernel_size=(5, 5), stride=(2, 2)) (sigmoid4): Sigmoid() ) ) ) ) epoch 0000/1000 0%| | 0/256 [00:09<?, ?it/s] Caught exception in worker thread CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/test_tube/argparse_hopt.py", line 39, in optimize_parallel_gpu_private results = train_function(trial_params, gpu_id_set) File "/behavenet/behavenet/fitting/ae_grid_search.py", line 112, in main fit(hparams, model, data_generator, exp, method='ae') File "/root/capsule/behavenet/behavenet/fitting/training.py", line 347, in fit loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True) File "/root/capsule/behavenet/behavenet/models/aes.py", line 766, in loss loss.backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

—---

It seems that the code is not recognizing all four GPUs and is unable to utilize their capacity. In my troubleshooting efforts, I've explored the following steps:

configurations are set according to [user guide documentation]: “Training an AE can be slow: you can speed up the training by parallelizing over multiple gpus. To do this, just specify n_parallel_gpus to be the number of gpus you wish to use per model. The code will split up the gpus specified in gpus_viz into groups of size n_parallel_gpus (or less if there are leftover gpus) and run the models accordingly.”
The model is fitted on cloud computing - Code Ocean - using a four GPU machine which has the following properties:

GPU 0: Tesla M60, 7.982743552GB GPU 1: Tesla M60, 7.982743552GB GPU 2: Tesla M60, 7.982743552GB GPU 3: Tesla M60, 7.982743552GB

PyTorch and Cuda Versions are as follows:

PyTorch Version: 1.12.1+cu116 CUDA Version: 11.6

Nvidia-smi

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Previously, I fitted this model architecture, which is the same one as the paper, on the same dataset as now, except being downsampled into 128 x 128, to use on a single GPU machine (14 GB). For that try, I used the same platform (Code Ocean), and the code worked successfully.

In the current run, I add the second camera view, and keep the frame size as in the original data (304 * 288). it seems that the code cannot identify other GPUs or doesn't implement their memories.

I tried running the integration test, and here is the final result:

================== Integration Test Results ==================

ae: passed arhmm: passed neural-ae: passed neural-ae-me: passed neural-labels: passed neural-arhmm: passed ae-multisession: passed vae: passed beta-tcvae: passed cond-ae-msp: passed cond-vae: passed ps-vae: passed msps-vae-multisession: passed labels-images: passed

total time to perform integration test: 195.396645 sec

The code works properly in CPU mode on this data.
I tried "mem_limit_gb": 5,6,7, 8, 24.0. Also, reduced "tt_n_gpu_trials" to 128. None of them helped.
Dataset consists of trials of different length, with mean: 1772, std: 604 (frames per trial).

Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you may have. Thank you!

themattinthehatt commented 7 months ago

Hi @faezeamin , I have not tried the multi-gpu training in several years - I can test this out on my end after the thanksgiving break and get back to you. In the meantime, is requesting a GPU from code ocean with more memory possible for you?

faezeamin commented 7 months ago

Thank you for your prompt response! Yes - the model is functional on a single GPU with a size of 15.65 GB. But I'm interested in exploring the possibility of faster run-times using multiple GPUs, if feasible.

themattinthehatt commented 6 months ago

@faezeamin sorry for not getting to this yet, haven't forgotten about it though

faezeamin commented 4 months ago

Hi @themattinthehatt - Just following up on this issue. Have you got a chance to take look on multiple GPU analysis? Thanks, -Faeze

themattinthehatt / behavenet

Running Code with Multiple GPUs #35