Partially resolved: Feature/#54

kdg1993 commented 1 year ago

Motivation 🤔

Soon, we need to use multi-GPU for experiments. Also, it is not sure raytune automatically supports parallel GPU usage. Thus, it is needed to implement multi-GPU support and test if it is well functioning. (Detailed description about the motivation and implementation direction, please see #54 )

Key Changes 🔑

nn.DataParallel is used for every case (single, multi-run, raytune on-off)
"cuda:0" is replaced as just "cuda"

To Reviewers 🙏

The nn.DataParallel at least show it's parallel distribution of process (You can see the same PID in on the three GPUs)
With amp, it works but I can not sure 100%
The officially recommended nn.parallel.DistributedDataParallel might be implemented in the future but not sure due to the difficulty (the detailed description is in #54)

resolves: #54 references:

seoulsky-field commented 1 year ago

Looks good to me! I think it would be helpful as soon as possible! 👍 By the way, if you use a device code like the changed, could you change the codes from with torch.autocast(device_type=str(device).split(":")[0]): to with torch.autocast(device_type=str(device)):? When I implemented, because of the output "cuda:0", I made the codes like that!

kdg1993 commented 1 year ago

Thanks for your quick response and sharp questions 👍

I am sorry for forgetting to add the tested result link 😭 (https://wandb.ai/snuh_interns/multi_gpu_test?workspace=user-snuh_interns)
It works well (at least 4 settings all show distributed assigning to GPUs) for all settings
It automatically adapts the number of GPUs. For example, if there is only one GPU, it is the same as not using If there are multi-GPU but want to use only one or some of them, nn.DataParallel supports the selection of choices. However, it needs more analysis because of raytune's resources assignment. Thus, it takes more time to implement the selection option

The tested command of the tested result in answer no.1 is below

clear

Single run wo/ ray

python main.py \ Dataset.train_size=3000 \ use_amp=False \ project_name='multi_gpu_test' \ logging.setup.name='single_wo_ray' \ hparams_search=none

Single run w/ ray

python main.py \ Dataset.train_size=3000 \ use_amp=False \ project_name='multi_gpu_test' \ logging.setup.name='single_w_ray' \ hparams_search=raytune

Multi run w/o ray

python main.py --multirun \ Dataset=CheXpert,MIMIC \ Dataset.train_size=3000 \ use_amp=False \ project_name='multi_gpu_test' \ logging.setup.name='multi_wo_ray' \ hparams_search=none

Multi run w/ ray

python main.py --multirun \ Dataset=CheXpert,MIMIC \ Dataset.train_size=3000 \ use_amp=False \ project_name='multi_gpu_test' \ logging.setup.name='multi_w_ray' \ hparams_search=raytune

kdg1993 commented 1 year ago

Thanks, @seoulsky-field what I could not think to change!

I think I understand due to the precise description of yours & I will change it without PR

seoulsky-field / CXRAIL-dev