mikeqzy / 3dgs-avatar-release

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
MIT License
329 stars 31 forks source link

Custom dataset requirements #25

Closed agelosk closed 1 week ago

agelosk commented 2 months ago

Hello and thank you for your work.

I am trying to run your method in a custom dataset and I am trying to see what is the minimum data requirements. Given a dataset structurally similar to ZJU-MoCap your method can run on it directly after this preprocess script. Thus, what is needed for this script to run is enough to run your method too.

After studying the preprocessing script, I concluded it needs the following to run:

Thus, if we are just given an RGB video and we want to run your method, we should 1) run a masking method to produce the human mask per frame, 2) run a method to give us intrinsic and extrinsic per camera (most probably COLMAP), 3) run a 3D human reconstruction method to get poses, shapes and Rh, Th per pose, and 4) format the data properly to run the preprocess script and then run your train.py file.

Hopefully, that sums up the procedure correctly. I have two questions, though: 1) What is D and how do we obtain this? 2) How do we obtain 'poses', 'Rh', 'Th' and 'shapes' per pose? Does SPIN predict this information per pose?

Thank you for your work and time, Agelos

mikeqzy commented 2 months ago

Hello, thank you for your interest in our work. Your conclusions are absolutely correct. Regarding your questions:

  1. D is the camera distortion parameters provided by ZJUMoCap. Usually it's safe to set it to all zero.
  2. These are basically the body pose, global orientation, translation and shape parameter, respectively. You can estimate SMPL pose and shape with any off-the-shelf methods, including SPIN for sure.

Recently my colleague has provided a script for data processing on custom videos, please refer to https://github.com/mikeqzy/3dgs-avatar-release/issues/11#issuecomment-2308474585. Hope it helps!

agelosk commented 1 month ago

Thanks for your reply. The code provided by your colleague is indeed very useful. Few follow up questions:

  1. Any idea why on 3, R is set to the identity matrix? I have COLMAP information per camera. Is it safe to set K directly from cameras.txt, R = quaternion_to_matrix(qw,qx,qy,qz) and T = - R.T * [tx,ty,tz] where qw,qx,qy,qz,tx,ty,tz are directly from images.txt.
  2. I did try the above and the 3dgs-avatar is not able to be trained, i just get black rendered images, and at iteration 3000 i get an error q1[0] = 1. # [1,0,0,0] represents identity rotation. IndexError: index 0 is out of bounds for dimension 0 with size 0

This is regarding the Neuman dataset. Obviously there is something wrong with my K,R,T inputs. When I am using 3 to initialize K,R,T the model does train but I get very poor results identifying that still something is wrong. Can you enlighten me with how I should initialize K,R,T from COLMAP?

image

Appreciate your time, Agelos

mikeqzy commented 1 month ago
  1. I'm not familiar with mmhuman3d, but here's my guess: The smpl estimation should be aligned with camera calibration. For ZJUMoCap, the SMPL parameters are in the world coordinate and cameras are calibrated. For custom video, we have no ground-truth camera calibration. The SMPL estimate from 3 is in the camera space of each given camera, with a preset intrinsics, see here. So the SMPL parameters from 3 are view-dependent and not consistent in the world coordinate, but they are aligned with the initialized camera (preset K, identity matrix R, zero T). With COLMAP camera estimate, you might need to do additional optimization to get consistent world-coordinate SMPL parameters.
  2. The model enables the non-rigid deformation module at 3000 step, though I'm not sure why this error happens.

By using the default camera parameter from 3 I believe the model will behave normally on Neuman dataset. Hope it helps!