open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.84k stars 1.25k forks source link

How much system ram is required per gpu for interhand3d dataset? #672

Open pablovela5620 opened 3 years ago

pablovela5620 commented 3 years ago

Looking at the log provided it looks like 8 Titan x gpus were used to train the interhand dataset with a batch size of 16 and 2 workers per gpu.

The full interhand dataset is pretty massive (<1 million images) and my understanding is that per worker and gpu one loads up the entire dataset into system ram (not gpu vram) so even with lets say a 128gb 8 gpus *2 workers = a HUGE amount of system ram. Am I understanding this correctly? I haven't had a chance to test yet

How much system ram did the machine that was used to train have? It seems super difficult to try to retrain on a multi GPU system without a really significant amount of system ram (>256gb?).

innerlee commented 3 years ago

one loads up the entire dataset into system ram

this is not the case

pablovela5620 commented 3 years ago

Understood, so I had a chance to try and train the model using the provided config. I'm using a machine with 128gb of ram and 2 A6000 gpus.

When I run on a single gpu using python tools/train.py configs/hand3d/InterNet/interhand3d/res50_interhand3d_all_256x256.py it uses up about 30GB of ram to load and train the network, the reason I assumed it loaded the entire dataset into system ram is the large amount of ram when using distributed training.

after tools/dist_train.sh I have the following problem.

Using the provided config with dist_train and only changing num gpus and num workers

So with this testing, I had the following questions

  1. How do I manage the amount of ram used without sacrificing the number of workers?
  2. Is this typical amount of ram for this dataset?
  3. What if I want to use the 30fps version of the dataset (13 million images vs the 1.3million so around 10 times larger)? Since my guess is this would increase the amount of ram need by a TON

I really appreciate the help!

ly015 commented 3 years ago

@zengwang430521 Could you please check this issue?

zengwang430521 commented 3 years ago

Hi @pablovela5620. We load all annotation into memory before training, and this will cost a lot of memory. So if you find memory insufficient, you can use less workers. And it's grateful that our implementation may not be suitable for 30-fps version now, because tit's too massive.

innerlee commented 3 years ago

@zengwang430521 The implementation could be improved.

pablovela5620 commented 3 years ago

@zengwang430521 so with the current implementation it seems like there are basically two solutions if using distributed single node training

  1. Reduce the number of workers (in my case I can only use 1)
  2. Buy more ram

I did notice that using distributed training with 1 gpu vs normal training with 1 gpu results in higher ram usage (68gb vs ~30gb). Not totally sure as to why. Some clarity here would be appreciated.

Also how much ram did the 8 gpu 2 worker machine use when training on the interhand3d dataset?

If I was to modify the dataset implementation (so that I could get it working with 30FPS version), it seems like its more of a design decision over the whole of mmpose hand datasets. I may be completely wrong here, and please correct me if I am, the use of xtcocotools in HandBaseDataset

from xtcocotools.coco import COCO

self.coco = COCO(ann_file)        
self.img_ids = self.coco.getImgIds()

basically loads the entire annotation into memory for any dataset that depends on it, also looking at Interhand2D/Interhand3D and others when calling def _get_db()

with open(self.camera_file, 'r') as f:            
    cameras = json.load(f)        
with open(self.joint_file, 'r') as f:            
    joints = json.load(f)

is what is eating up all the system memory inside the gt_db object. This seems consistent with all other datasets as well of first loading the entire dataset and then running the augmentation/preprocessing pipelines

So rather than loading the entire dataset, I would have to overload def __getitem__(self, idx): to load the dataset on each call rather than all at once? Does this make sense or are there some other considerations I should be looking at and downsides of not loading all at once