nyu-systems / Grendel-GS

Ongoing research training gaussian splatting at scale by distributed system
Apache License 2.0
380 stars 20 forks source link

Extra Memory Taken on GPU0 #22

Closed MaureenZOU closed 2 months ago

MaureenZOU commented 3 months ago

Any people meet with the case that there are many fixed GPU slots used in gpu0 than others. e.g. xueyanz(730M) xueyanz(730M) xueyanz(730M) xueyanz(730M) ...

Thanks!

TarzanZhao commented 2 months ago

Hi, do you mean GPU-0 takes up 730MB more memory than other GPUs?

TarzanZhao commented 2 months ago

Could you please provide mode details about the issue? Than we may figure it out together :)

MaureenZOU commented 2 months ago

So it is the following case: GPU0 Memory: 2000M + 730M + 730M + 730M GPU1 Memory: 2000M GPU2 Memory: 2000M GPU3 Memory: 2000M This seems like some communication issue, while GPU0 holds constant redundant memory.

Thanks for the reply!

TarzanZhao commented 2 months ago

Hi, by default, args.distributed_dataset_storage is true. This flag enables saving/loading all images from GPU-0 and then broadcast these images to other GPU. That will cause more GPU memory consumption on GPU-0. However, I did not expect that to be as large as 730MB*3 .

TarzanZhao commented 2 months ago

@MaureenZOU We can schedule a meeting to figure out this issue together if you think this is important to you? my email is: hz3496@nyu.edu

MaureenZOU commented 2 months ago

Thanks so much for your helpful reply! I will check whether it is caused by args.distributed_dataset_storage, as I just tried the standard dataset for use, so that won't be really problematic for now : ) Great work BTW!

TarzanZhao commented 2 months ago

OK. Thank you for your kind words. let me know if you need any discussion and help!

MaureenZOU commented 2 months ago

This is caused by a mystery conda environment bug. Haven't fixed this yet, after the torch.distributed init, those zombie processes will naturally appear. Not related to the great code base.