Multi-GPU training - Githubissues

Hi all, I'm trying to run a multi-gpu training by running the following command: CUDA_VISIBLE_DEVICES=1,3 python ./jerex_train.py --config-path configs/docred_joint after the run is launched I see memory allocation for device 0 (i.e. CUDA 1), but not on device 1. I have tried with batch_size > 1 as well.

I guess some modifications are needed in the cfg file, specifically in the following section:

distribution:
  gpus: [0, 1]
  accelerator: ''
  prepare_data_per_node: false

How can that be solved? My env is as described in your requirements.txt file.

Thanks.

lavis-nlp / jerex

Multi-GPU training #17