Open pablovela5620 opened 2 months ago
Seconding this. We have multiple Compute Capability == 6.1 GPUs that could benefit from the multi-gpu training features in gsplat v1.
Yes gsplat>=1.0 requires GPU with compute capability >= 7.0, due to the usage of cg::labeled_partition
https://github.com/nerfstudio-project/gsplat/blob/da0a201b8eafacb127fd8f09c56f2989b453a9ab/gsplat/cuda/csrc/world_to_cam_bwd.cu#L80-L83
Work-around solution is welcomed!
Unfortunately, I know zero cuda, I can try to see if I can hack my way to an implementation using ChatGPT. But I would need some guidance if someone with more cuda experience is willing. Is cg::labeled_partition the only thing holding back the use of sub 7.0 gpus?
So far this is what I've come up with
#if __CUDA_ARCH__ >= 700
// Write out results with warp-level reduction
auto warp = cg::tiled_partition<32>(cg::this_thread_block());
auto warp_group_g = cg::labeled_partition(warp, gid);
if (v_means != nullptr) {
warpSum(v_mean, warp_group_g);
if (warp_group_g.thread_rank() == 0) {
v_means += gid * 3;
GSPLAT_PRAGMA_UNROLL
for (uint32_t i = 0; i < 3; i++) {
gpuAtomicAdd(v_means + i, v_mean[i]);
}
}
}
// Similar code for v_covars and v_viewmats
#endif
#else
// Alternative implementation using shared memory and atomic operations
if (v_means != nullptr) {
// Allocate shared memory for reduction
__shared__ vec3<OpT> shared_v_mean[BLOCK_SIZE]; // BLOCK_SIZE is the number of threads per block
shared_v_mean[threadIdx.x] = v_mean;
__syncthreads();
// Perform block-level reduction
for (unsigned int stride = BLOCK_SIZE / 2; stride > 0; stride >>= 1) {
if (threadIdx.x < stride) {
shared_v_mean[threadIdx.x] += shared_v_mean[threadIdx.x + stride];
}
__syncthreads();
}
if (threadIdx.x == 0) {
v_means += gid * 3;
GSPLAT_PRAGMA_UNROLL
for (uint32_t i = 0; i < 3; i++) {
gpuAtomicAdd(v_means + i, shared_v_mean[0][i]);
}
}
}
// Similar code for v_covars and v_viewmats
#endif
I think labeled_partition
is the only thing that holding it back. But it should require much more changes than this at multiple places where this is being used.
Feel feel to test your ideas locally and see if it passes the tests (pytest tests/
). We are happy to accept a PR about it if anyone comes out a fix.
Got it, will see if I can get this working!
Hi,I got the same problem. Have you solve the problem yet?
Unfortunately not, I did some initial digging but haven't had time to come up with a fix
I have an older 1080ti that I use and found that changing from gsplat v0.1.12 to >=v1.0.0 I can no longer use the gsplat library, I don't know if there are plans to support older gpus or if there's anyway to still allow training this similar to tiny-cuda-nn which uses provides the following
This is the error that I'm getting with the later version of gsplat
I also want to point out that the original inra code and rasterizer both still work on my 1080ti