Closed twixupmysleeve closed 4 days ago
Does it work with --cpu
? If so, this might be some issue with the MPS backend/rasterizer.
Hey,
I'm having a very similar issue on my Macbook Air M2 8GB RAM. Everything builds and runs, however the .ply file is empty on the viewer. However, I just tried the banana on --cpu
with 500 iterations and it worked fine, but taking longer. Seems like a MPS issue.
Same issue here, with --cpu it works!
me too, seems like the gpu support on mac is still buggy
Same issue on Mac M2 metal version. As for the banana example, everything is OK until the 22th iteration, when the 1457th point becomes: [ nan, nan, nan, 0. , 0. ,0. , 0.28498277, 0.28498277, 0.28498277, ...... , -1.3643292 , nan, nan, nan, nan, nan, nan, nan]. I think there exists probability the data lost its value and becomes nan, and we should pass the iteration if it happens.
I've been trying to train on an M1 max using the MPS gpu build options. Using the banana dataset with n=2000, the program outputs a Nan at some point. After this the training goes downhill, and produces very artifacted results.
Sometimes it encounters a nan and crashes immediately with the following message:
Step 390: 0.109648 Step 400: 0.124691 Step 410: nan element 0 of tensors does not require grad and does not have a grad_fn Exception raised from run_backward at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/autograd.cpp:108 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 52 (0x100a8ecbc in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 92 (0x100a8b8dc in libc10.dylib) frame #2: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, bool, bool) + 1228 (0x10f945290 in libtorch_cpu.dylib) frame #3: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&, std::__1::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor>> const&) + 96 (0x10f944628 in libtorch_cpu.dylib) frame #4: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, std::__1::optional<at::Tensor> const&, std::__1::optional<bool>, bool) const + 384 (0x10f995674 in libtorch_cpu.dylib) frame #5: at::Tensor::backward(at::Tensor const&, std::__1::optional<bool>, bool, std::__1::optional<c10::ArrayRef<at::Tensor>>) const + 248 (0x1007b6194 in opensplat) frame #6: main + 16752 (0x1007b25b0 in opensplat) frame #7: start + 2840 (0x1996f0274 in dyld)
The problem is narrowed down to _gsplatmetal.metal, where the calculation produces nan sometimes. I am not familiar with metal programming, but "1.f / (1.f - alpha)" is highly suspected. I add "alpha < 0.99f" in line 962, and the banana example can produce right ply file (n=1000).
I changed line 962 in OpenSplat/rasterizer/gsplat-metal/gsplat_metal.metal
to:
if(valid && alpha<0.99f){
. Now, running ./opensplat my/path/to/banana -n 2000
looks much better!
Thank you for your help!
That's awesome @zctu ! Thanks for sharing your findings.
Would you be interested in opening a PR to fix this? 🙏
That's awesome @zctu ! Thanks for sharing your findings.
Would you be interested in opening a PR to fix this? 🙏
Glad to do it: ) The PR is https://github.com/pierotofy/OpenSplat/pull/139 This is the first time I commit a PR, and I am not sure whether I do it properly?
You did, thanks!
I followed the instructions to build and downloaded the banana folder to test it out. It steps through 0 till 2000 but the final ply model is empty. I see that when initially loading functions, it is returning null.
First, it successfully loads the images:
It also seems to successfully load the libraries right after this, but then the load functions keep returning null:
cameras.json
is not empty butsplat.ply
is completely empty and doesn't render online or even on the Mac Viewer. I have a Macbook Pro with M1 Pro 16G memory, running Sonoma 14.0