pierotofy / OpenSplat

Production-grade 3D gaussian splatting with CPU/GPU support for Windows, Mac and Linux 🚀
https://antimatter15.com/splat/?url=https://splat.uav4geo.com/banana.splat
GNU Affero General Public License v3.0
924 stars 87 forks source link

loading null functions on MacOS #130

Closed twixupmysleeve closed 4 days ago

twixupmysleeve commented 2 months ago

I followed the instructions to build and downloaded the banana folder to test it out. It steps through 0 till 2000 but the final ply model is empty. I see that when initially loading functions, it is returning null.

First, it successfully loads the images:

❯ ./opensplat /Users/pats/Downloads/banana -n 2000
Using MPS
Reading 14241 points
Loading /Users/pats/Downloads/banana/images/frame_00001.JPGLoading
/Users/pats/Downloads/banana/images/frame_00003.JPG
Loading /Users/pats/Downloads/banana/images/frame_00005.JPGLoading /Users/pats/Downloads/banana/images/frame_00008.JPG

Loading /Users/pats/Downloads/banana/images/frame_00015.JPG
Loading /Users/pats/Downloads/banana/images/frame_00010.JPG
Loading /Users/pats/Downloads/banana/images/frame_00013.JPG
Loading /Users/pats/Downloads/banana/images/frame_00014.JPG
Loading /Users/pats/Downloads/banana/images/frame_00004.JPG
Loading /Users/pats/Downloads/banana/images/frame_00002.JPG
Loading /Users/pats/Downloads/banana/images/frame_00006.JPG
Loading /Users/pats/Downloads/banana/images/frame_00016.JPG
Loading /Users/pats/Downloads/banana/images/frame_00009.JPG
Loading /Users/pats/Downloads/banana/images/frame_00011.JPG

It also seems to successfully load the libraries right after this, but then the load functions keep returning null:

init_gsplat_metal_context: loading '/Users/pats/Library/CloudStorage/OneDrive-Personal/Georgie/Polo/3DGS/OpenSplat/build/default.metallib'
init_gsplat_metal_context: loaded '/Users/pats/Library/CloudStorage/OneDrive-Personal/Georgie/Polo/3DGS/OpenSplat/build/default.metallib', functions: compute_cov2d_bounds_kernel, project_gaussians_backward_kernel, get_tile_bin_edges_kernel, rasterize_backward_kernel, map_gaussian_to_intersects_kernel, nd_rasterize_backward_kernel, project_gaussians_forward_kernel, compute_sh_backward_kernel, compute_sh_forward_kernel, nd_rasterize_forward_kernel
init_gsplat_metal_context: load function nd_rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: load function nd_rasterize_forward_kernel with label: (null)
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: load function project_gaussians_forward_kernel with label: (null)
init_gsplat_metal_context: load function project_gaussians_backward_kernel with label: (null)
init_gsplat_metal_context: load function compute_sh_forward_kernel with label: (null)
init_gsplat_metal_context: load function compute_sh_backward_kernel with label: (null)
init_gsplat_metal_context: load function compute_cov2d_bounds_kernel with label: (null)
init_gsplat_metal_context: load function map_gaussian_to_intersects_kernel with label: (null)
init_gsplat_metal_context: load function get_tile_bin_edges_kernel with label: (null)
Step 10: 0.208454
.
.
.

cameras.json is not empty but splat.ply is completely empty and doesn't render online or even on the Mac Viewer. I have a Macbook Pro with M1 Pro 16G memory, running Sonoma 14.0

pierotofy commented 2 months ago

Does it work with --cpu? If so, this might be some issue with the MPS backend/rasterizer.

luizgbraga commented 2 months ago

Hey,

I'm having a very similar issue on my Macbook Air M2 8GB RAM. Everything builds and runs, however the .ply file is empty on the viewer. However, I just tried the banana on --cpu with 500 iterations and it worked fine, but taking longer. Seems like a MPS issue.

xipherx commented 1 month ago

Same issue here, with --cpu it works!

ouceduxzk commented 1 month ago

me too, seems like the gpu support on mac is still buggy

zctu commented 4 weeks ago

Same issue on Mac M2 metal version. As for the banana example, everything is OK until the 22th iteration, when the 1457th point becomes: [ nan, nan, nan, 0. , 0. ,0. , 0.28498277, 0.28498277, 0.28498277, ...... , -1.3643292 , nan, nan, nan, nan, nan, nan, nan]. I think there exists probability the data lost its value and becomes nan, and we should pass the iteration if it happens.

das-ag commented 2 weeks ago

I've been trying to train on an M1 max using the MPS gpu build options. Using the banana dataset with n=2000, the program outputs a Nan at some point. After this the training goes downhill, and produces very artifacted results.

Sometimes it encounters a nan and crashes immediately with the following message: Step 390: 0.109648 Step 400: 0.124691 Step 410: nan element 0 of tensors does not require grad and does not have a grad_fn Exception raised from run_backward at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/autograd.cpp:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 52 (0x100a8ecbc in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 92 (0x100a8b8dc in libc10.dylib) frame #2: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool) + 1228 (0x10f945290 in libtorch_cpu.dylib) frame #3: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&) + 96 (0x10f944628 in libtorch_cpu.dylib) frame #4: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, std::__1::optional<at::Tensor> const&, std::__1::optional<bool>, bool) const + 384 (0x10f995674 in libtorch_cpu.dylib) frame #5: at::Tensor::backward(at::Tensor const&, std::__1::optional<bool>, bool, std::__1::optional<c10::ArrayRef<at::Tensor>>) const + 248 (0x1007b6194 in opensplat) frame #6: main + 16752 (0x1007b25b0 in opensplat) frame #7: start + 2840 (0x1996f0274 in dyld)

zctu commented 5 days ago

The problem is narrowed down to _gsplatmetal.metal, where the calculation produces nan sometimes. I am not familiar with metal programming, but "1.f / (1.f - alpha)" is highly suspected. I add "alpha < 0.99f" in line 962, and the banana example can produce right ply file (n=1000).

das-ag commented 5 days ago

I changed line 962 in OpenSplat/rasterizer/gsplat-metal/gsplat_metal.metal to: if(valid && alpha<0.99f){. Now, running ./opensplat my/path/to/banana -n 2000 looks much better! banana_screenshot Thank you for your help!

pierotofy commented 4 days ago

That's awesome @zctu ! Thanks for sharing your findings.

Would you be interested in opening a PR to fix this? 🙏

zctu commented 4 days ago

That's awesome @zctu ! Thanks for sharing your findings.

Would you be interested in opening a PR to fix this? 🙏

Glad to do it: ) The PR is https://github.com/pierotofy/OpenSplat/pull/139 This is the first time I commit a PR, and I am not sure whether I do it properly?

pierotofy commented 4 days ago

You did, thanks!