Open dreichCSL opened 2 years ago
I have a strange problem. No matter how I revised the code(such as gpu_num of config file ), only single GPU training can be performed. As you said, training this code demands amounts of resources. Therefore,I have to perform multi-GPU training. Could you help me find the reason? Thanks a lot!
I don't think I had an issue using multi-GPU by e.g. simply setting gpu_num to 2, sorry. This might be a long shot, but: Maybe try modifying other stuff to see if your yaml file is actually loaded.
Thank you for your reply! There is a distributed parameter of get_loader function in data_pipeline.py. This parameter (distributed) is False by default, should it be changed to True?
I didn't make changes to the code to use multi-gpu, it just worked right away. Things I'd look into: Does other code work with multi-gpu in my environment? Are all my gpus free to use / unoccupied? Did I follow all setup/install instructions of this repo exactly? Also, if I made some code changes to the repo, maybe start from scratch again to try multi-gpu.
I conducted another test and found that it was multi-GPU when building the model but single-GPU when loading data. Are you in the same situation?
Don't remember exactly. But both my gpus are processing the mini-batches during training (which takes a while to start).
How long did it take you to finish training?It takes me a month to train for one epoch using a single GPU(A100).
I think for my setup (see above) it took one day for one epoch (based on all GQA training samples). Later curriculum steps should be longer (there's progressively more data in each step). I also had an issue with batch size. The gpu memory consumption was growing during training, so it might run out of memory later on in training if batch size is too large - but you won't see this in the beginning.
Note: I used 1024-dim visual features, which I think is half size of the GQA features (2048 if I recall correctly)
I hope the authors can address this issue, but if not:
This can maybe act as a warning to people who are thinking about trying out this code. Training a model with the given code takes an inordinate amount of resources and time. Training seems to be set up to run on at least 4 GPUs of RTX A6000 level (I have access to 2 of them and their 48GB GPU RAM each is not enough). When reducing batch size to workable sizes, the model takes multiples of the time that even large-scale models like LXMERT take to train. Be warned, that you'll likely need to invest considerable effort into the code to make it run efficiently.