lavis-nlp / jerex

PyTorch code for JEREX: Joint Entity-Level Relation Extractor
MIT License
61 stars 15 forks source link

Multi-GPU training #17

Closed AvivSham closed 1 year ago

AvivSham commented 1 year ago

Hi all, I'm trying to run a multi-gpu training by running the following command: CUDA_VISIBLE_DEVICES=1,3 python ./jerex_train.py --config-path configs/docred_joint after the run is launched I see memory allocation for device 0 (i.e. CUDA 1), but not on device 1. I have tried with batch_size > 1 as well.

I guess some modifications are needed in the cfg file, specifically in the following section:

distribution:
  gpus: [0, 1]
  accelerator: ''
  prepare_data_per_node: false

How can that be solved? My env is as described in your requirements.txt file.

Thanks.

AvivSham commented 1 year ago

problem solved by modifying the cfg accelerator: 'dp'