[BUG] multi gpu training without --single_gpu

simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Apache License 2.0

288 stars 55 forks source link

[BUG] multi gpu training without --single_gpu #19

Open menatallh opened 3 years ago

menatallh commented 3 years ago

Describe the bug Problem with multi gpu training when i remove --single gpu

Expected behavior it detects the available gpus

Screenshots

System Info:

OS: [e.g. Ubuntu 18.04]
Python version [e.g. 3.8.6]
PyTorch version [e.g. 1.7.0+cu11]

Additional context Add any other context about the problem here.

simon-ging commented 3 years ago

If you have solved it, please consider posting your fix for others.

menggehe commented 3 years ago

Did you solve this problem?

simon-ging commented 3 years ago

Does it still happen? If yes please post a complete bug report: Which command do you input, the complete error message, output of system command "nvidia-smi", which system / python / pytorch version. Then I will look into it.

menggehe commented 3 years ago

command :

message：

output of system command "nvidia-smi"：

System Info: OS: Ubuntu 18.04 Python version 3.8.5 PyTorch version 1.8.1

menggehe commented 3 years ago

I change some code in utils_torch.py: 1. before: after:

But the model still uses only one GPU device:0.

simon-ging commented 3 years ago

I will check this problem, it should be possible to train on multiple GPUs. Other than that, unless you increase the model size or batch size, a single 12GB GPU is more than enough to train retrieval