themattinthehatt / behavenet

Toolbox for analyzing behavioral videos and neural activity
https://behavenet.readthedocs.io/
MIT License
57 stars 15 forks source link

Running Code with Multiple GPUs #35

Open faezeamin opened 7 months ago

faezeamin commented 7 months ago

Thank you for providing the code!

I'd like to run it using multiple GPUs with my own dataset, but I encountered the following error:

—---

DATA CONFIG: lab: han expt: jenelia-exp animal: HH09 session: S07_20210611 n_input_channels: 2 y_pixels: 304 x_pixels: 288 use_output_mask: False frame_rate: 30.0 neural_type: ca neural_bin_size: 0.03333333333333333 approx_batch_size: 200

COMPUTE CONFIG: device: cuda n_parallel_gpus: 4 gpus_viz: 0;1;2;3 tt_n_gpu_trials: 128 tt_n_cpu_trials: 1000 tt_n_cpu_workers: 5 mem_limit_gb: 7

TRAINING CONFIG: export_train_plots: True export_latents: True pretrained_weights_path: None val_check_interval: 1 learning_rate: 0.0001 max_n_epochs: 1000 min_n_epochs: 10 enable_early_stop: False early_stop_history: 10 rng_seed_train: None as_numpy: False batch_load: True rng_seed_data: 0 train_frac: 1.0 trial_splits: 8;1;1;0

MODEL CONFIG: experiment_name: dim_search model_type: conv n_ae_latents: 16 l2_reg: 0.0 rng_seed_model: 0 fit_sess_io_layers: False ae_arch_json: None model_class: ae conditional_encoder: False msp.alpha: None vae.beta: 1 vae.beta_anneal_epochs: 100 beta_tcvae.beta: 1 beta_tcvae.beta_anneal_epochs: 100 ps_vae.alpha: 1 ps_vae.beta: 1 ps_vae.gamma: 1 ps_vae.delta: 1 ps_vae.anneal_epochs: 100 n_background: 3 n_sessions_per_batch: 1

using data from following sessions: /root/capsule/scratch/results/han/jenelia-exp/HH09/S07_20210611 constructing data generator...done Generator contains 1 SingleSessionDatasetBatchedLoad objects: han_jenelia-exp_HH09_S07_20210611 signals: ['images'] transforms: OrderedDict([('images', None)]) paths: OrderedDict([('images', '/root/capsule/data/base-data-dir/han/jenelia-exp/HH09/S07_20210611/data.hdf5')])

constructing model...Initializing with random weights done CustomDataParallel( (module): AE( (encoding): ConvAEEncoder( (encoder): ModuleList( (zero_pad0): ZeroPad2d((1, 2, 1, 2)) (conv0): Conv2d(2, 32, kernel_size=(5, 5), stride=(2, 2)) (relu0): LeakyReLU(negative_slope=0.05) (zero_pad1): ZeroPad2d((1, 2, 1, 2)) (conv1): Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2)) (relu1): LeakyReLU(negative_slope=0.05) (zero_pad2): ZeroPad2d((1, 2, 1, 2)) (conv2): Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2)) (relu2): LeakyReLU(negative_slope=0.05) (zero_pad3): ZeroPad2d((1, 2, 1, 2)) (conv3): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2)) (relu3): LeakyReLU(negative_slope=0.05) (zero_pad4): ZeroPad2d((1, 1, 0, 1)) (conv4): Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5)) (relu4): LeakyReLU(negative_slope=0.05) ) (FF): Linear(in_features=8192, out_features=16, bias=True) ) (decoding): ConvAEDecoder( (FF): Linear(in_features=16, out_features=8192, bias=True) (decoder): ModuleList( (convtranspose0): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5)) (relu0): LeakyReLU(negative_slope=0.05) (convtranspose1): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2)) (relu1): LeakyReLU(negative_slope=0.05) (convtranspose2): ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2)) (relu2): LeakyReLU(negative_slope=0.05) (convtranspose3): ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2)) (relu3): LeakyReLU(negative_slope=0.05) (convtranspose4): ConvTranspose2d(32, 2, kernel_size=(5, 5), stride=(2, 2)) (sigmoid4): Sigmoid() ) ) ) ) epoch 0000/1000 0%| | 0/256 [00:09<?, ?it/s] Caught exception in worker thread CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/test_tube/argparse_hopt.py", line 39, in optimize_parallel_gpu_private results = train_function(trial_params, gpu_id_set) File "/behavenet/behavenet/fitting/ae_grid_search.py", line 112, in main fit(hparams, model, data_generator, exp, method='ae') File "/root/capsule/behavenet/behavenet/fitting/training.py", line 347, in fit loss_dict = model.loss(data, dataset=dataset, accumulate_grad=True) File "/root/capsule/behavenet/behavenet/models/aes.py", line 766, in loss loss.backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 536.00 MiB (GPU 0; 7.43 GiB total capacity; 5.41 GiB already allocated; 505.19 MiB free; 6.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

—---

It seems that the code is not recognizing all four GPUs and is unable to utilize their capacity. In my troubleshooting efforts, I've explored the following steps:

GPU 0: Tesla M60, 7.982743552GB GPU 1: Tesla M60, 7.982743552GB GPU 2: Tesla M60, 7.982743552GB GPU 3: Tesla M60, 7.982743552GB

PyTorch Version: 1.12.1+cu116 CUDA Version: 11.6

Wed Nov 22 11:47:07 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000000:00:1B.0 Off | 0 | | N/A 29C P8 16W / 150W | 0MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00000000:00:1C.0 Off | 0 | | N/A 27C P0 38W / 150W | 0MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00000000:00:1D.0 Off | 0 | | N/A 32C P0 38W / 150W | 0MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 00000000:00:1E.0 Off | 0 | | N/A 25C P0 39W / 150W | 0MiB / 7680MiB | 75% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

In the current run, I add the second camera view, and keep the frame size as in the original data (304 * 288). it seems that the code cannot identify other GPUs or doesn't implement their memories.

================== Integration Test Results ==================

ae: passed arhmm: passed neural-ae: passed neural-ae-me: passed neural-labels: passed neural-arhmm: passed ae-multisession: passed vae: passed beta-tcvae: passed cond-ae-msp: passed cond-vae: passed ps-vae: passed msps-vae-multisession: passed labels-images: passed

total time to perform integration test: 195.396645 sec


Despite these efforts, the issue persists. I would greatly appreciate any insights or suggestions you may have. Thank you!

themattinthehatt commented 7 months ago

Hi @faezeamin , I have not tried the multi-gpu training in several years - I can test this out on my end after the thanksgiving break and get back to you. In the meantime, is requesting a GPU from code ocean with more memory possible for you?

faezeamin commented 7 months ago

Thank you for your prompt response! Yes - the model is functional on a single GPU with a size of 15.65 GB. But I'm interested in exploring the possibility of faster run-times using multiple GPUs, if feasible.

themattinthehatt commented 6 months ago

@faezeamin sorry for not getting to this yet, haven't forgotten about it though

faezeamin commented 4 months ago

Hi @themattinthehatt - Just following up on this issue. Have you got a chance to take look on multiple GPU analysis? Thanks, -Faeze