lfranke / TRIPS

https://lfranke.github.io/trips/
MIT License
513 stars 30 forks source link

Training Killed #28

Open tsugg opened 7 months ago

tsugg commented 7 months ago

Hi,

I'm trying to get this running on a remote linux machine with an A10 gpu. I built in headless mode. Build looked fine, colmap2adop looks fine, but training gets killed. Below is the end of the stack trace showing the 'Killed' message. I tried lowering batch sizes, render size, and crop sizes in case it was a gpu memory issue, but even on low values it still returns killed. Nothing is populated in the errors.txt either in the experiment directory. Any ideas on what could be happening? I even tried using Docker and the same message was returned. Thanks.

CAM model: CameraModel::PINHOLE_DISTORTION Image Size 8831x6732 Aspect 1.31179 K 2456.38 2470.56 4415.5 3366 0 ocam 8831x6732 affine(1, 0, 0, 0, 0) cam2world() world2cam() ocam cut 1 normalized center 0 0 dist 0 0 0 0 0 0 0 0 CAM model: CameraModel::PINHOLE_DISTORTION Points 1931815 Colors 1 Normals 1 Avg. EV 0 Num Images 82 Num Cameras 82 Compute scene importance bounding box as 95% of points interval around center of mass Starting Compute center of mass...center of mass:1.08572 -0.00931753 0.720447 Done in 20.1654ms. Starting Build range vec... Done in 5.9594ms. Starting Sort range vec... Done in 152.128ms. Starting Extend box... Done in 23.709ms. Box: AABB: [-4.65199 -3.8338 -5.63757 ] [7.34715 3.68846 6.95357 ]

Modulo stepsize: 8 Train(71): 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 71 73 74 75 76 77 78 79 81 Test(11): 0 8 16 24 32 40 48 56 64 72 80 Killed

tsugg commented 7 months ago

I did some more digging and found that while the train and test image indices are posted to the stack trace, RAM usage continually balloons passed 60GBs until that Killed message pops up.

lfranke commented 6 months ago

Hi, this looks to me that the resolution and amount of cameras is too high for this implementation. Maybe try to use shared intrinsics and lower the resolution of the cameras? I think I never tried with more than 2.5K resolutions