multigpu training output

936384885xy commented 4 months ago

Hi, I managed to train my dataset on 2 RTX4090, with the following command:

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc-per-node=2 train.py --bsz 2 -s data/test9 -m output/t3

I found that there are two .ply files in the point_cloud folder(point_cloud_rk0_ws2.ply and point_cloud_rk1_ws2.ply) after 7000 iterations, then I used the SIBR_gaussianViewer_app.exe from the original 3dgs project to visualize these two models by changing there name into point_cloud.ply. I noticed that both of them cover the whole area, yet with a little difference in detail. I also tried to train the same dataset with single GPU, and the output is also different. I wonder what is the difference between the two .ply files trained by multigpu? Should I do some more postprocess to fuse them? Here are the results. the first picture is trained by single gpu, the other two are trained by multigpu.

TarzanZhao commented 4 months ago

Hi, when we do the multi-gpu training, each GPU will keep 1/2 of all gaussians. When we save them to disk, each GPU save their gaussians separately. For example, point_cloud_rk0_ws2.ply means rank=0, world size=2 (we have 2 GPU, and this point cloud is from GPU0). If you want to visualize all gaussians from all GPUs, you can concatenate these files by following this: https://github.com/EGalahad/GaussFusion?tab=readme-ov-file#gaussian-splatting-training

936384885xy commented 4 months ago

Thanks for your reply, I'll have a try.

nyu-systems / Grendel-GS

multigpu training output #6