ingra14m / Deformable-3D-Gaussians

[CVPR 2024] Official implementation of "Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction"
https://ingra14m.github.io/Deformable-Gaussians/
MIT License
944 stars 55 forks source link

About Dynerf dataset #15

Closed ch1998 closed 1 year ago

ch1998 commented 1 year ago

Hello, you tested multiple data sets in the paper, but I didn’t find the relevant processing code of dynerf in the project. Will they be added in the future?

pablodawson commented 1 year ago

Not an author but here's what worked for me to preprocess. To train just do -s (dynerf path)

import os
import cv2

data_path = "data/dynerf/sear_steak/"

# Create a VideoCapture object and read from input file
cams = []

for i in range(21):
    name = data_path + "cam" + str(i).zfill(2) 
    cam = cv2.VideoCapture(name + ".mp4")

    try:
        os.mkdir(name)
        outpath = os.path.join(name, "images/")
        os.mkdir(outpath)
    except:
        continue

    if (cam.isOpened()== False):
        print("Error opening video stream or file")

    # Read until video is completed
    j = 0
    while True:
        ret, frame = cam.read()

        if ret == True:
            frame = cv2.resize(frame, (1352, 1014), interpolation=cv2.INTER_AREA)
            cv2.imwrite(outpath + "/" + str(j).zfill(4) + ".png", frame)
            j += 1
        else:
            break

    print(f"Done with {name}")
ch1998 commented 1 year ago

Not an author but here's what worked for me to preprocess. To train just do -s (dynerf path)

import os
import cv2

data_path = "data/dynerf/sear_steak/"

# Create a VideoCapture object and read from input file
cams = []

for i in range(21):
    name = data_path + "cam" + str(i).zfill(2) 
    cam = cv2.VideoCapture(name + ".mp4")

    try:
        os.mkdir(name)
        outpath = os.path.join(name, "images/")
        os.mkdir(outpath)
    except:
        continue

    if (cam.isOpened()== False):
        print("Error opening video stream or file")

    # Read until video is completed
    j = 0
    while True:
        ret, frame = cam.read()

        if ret == True:
            frame = cv2.resize(frame, (1352, 1014), interpolation=cv2.INTER_AREA)
            cv2.imwrite(outpath + "/" + str(j).zfill(4) + ".png", frame)
            j += 1
        else:
            break

    print(f"Done with {name}")

Thank you very much for your reply, I will try it !

ingra14m commented 1 year ago

Hi, We have already supported DyNeRF.

However, I emphasized monocular in the title, because the majority of multi-view datasets are essentially sparse. Around 20 viewpoints are sparse for 3D-GS, and that's why Lego's convergence in the D-NeRF dataset is the worst. Therefore, I believe that using GS-based methods for multi-view has no advantages compared to IBR-based methods like ENeRF, im4d, and 4k4d. Of course, an extreme point cloud initialization in 4d-gaussian-splatting can help alleviate this issue.

From my perspective, I would recommend this method more for monocular datasets.

ch1998 commented 1 year ago

I tested multi-view and monocular data at the same time, and the results were very different. The monocular data had very good performance, while the multi-view data (dynerf-cook_spinach) had poor results and no clear images were generated. The initialization point cloud I used here is Obtained from the first frame of colmap. I can understand that multi-view data cannot reach the quality of monocular, but why is the quality so poor that it is completely impossible to see the content clearly?

ingra14m commented 1 year ago

I believe there are two reasons.

  1. The number of viewpoints. Cook-Spanish has only about 20 training viewpoints, while D-NeRF and other real-world monocular datasets have rich training viewpoints. This affects the geometric convergence of GS, leading to depth estimation errors and resulting in poor rendering results.
  2. The coverage range of viewpoints. A noteworthy detail is that the official dataset for 3D-GS is also 360 viewpoints. D-NeRF is 360, Nerf-DS is 90-180, and the coverage range of viewpoints also affects the convergence of depth. In situations where training cannot ensure accurate depth, satisfactory results can only be obtained through better initialization, but this is too tricky.

Another observed phenomenon in the experiments is that if 3D-GS can converge, the deformable-GS will converge as well. You can pay attention to the next version of the paper; I have already submitted the updated version to arXiv.

My experiment results on the DyNeRF dataset (300 frames) are shown as follows: Test PSNR Best iteration
sear-steak 33.239/36.7357 20k
coffee-martini 27.072/34.1954 10k
flame-steak 33.169/32.8128 33k
cut-beef 32.803/34.8096 20k
cook-spanish 32.803/34.3375 28k
flame-salmon 26.938/28.75(oom) 6k

Sparse viewpoints can result in overfitting on the training set.

ch1998 commented 1 year ago

Thanks Reply. Looking forward to your new paper.

ingra14m commented 1 year ago

A preview version if you don't mind.

ch1998 commented 1 year ago

I believe there are two reasons.

  1. The number of viewpoints. Cook-Spanish has only about 20 training viewpoints, while D-NeRF and other real-world monocular datasets have rich training viewpoints. This affects the geometric convergence of GS, leading to depth estimation errors and resulting in poor rendering results.
  2. The coverage range of viewpoints. A noteworthy detail is that the official dataset for 3D-GS is also 360 viewpoints. D-NeRF is 360, Nerf-DS is 90-180, and the coverage range of viewpoints also affects the convergence of depth. In situations where training cannot ensure accurate depth, satisfactory results can only be obtained through better initialization, but this is too tricky.

Another observed phenomenon in the experiments is that if 3D-GS can converge, the deformable-GS will converge as well. You can pay attention to the next version of the paper; I have already submitted the updated version to arXiv.

My experiment results on the DyNeRF dataset (300 frames) are shown as follows:

Test PSNR Best iteration sear-steak 33.239/36.7357 20k coffee-martini 27.072/34.1954 10k flame-steak 33.169/32.8128 33k cut-beef 32.803/34.8096 20k cook-spanish 32.803/34.3375 28k flame-salmon 26.938/28.75(oom) 6k Sparse viewpoints can result in overfitting on the training set.

Are there any special parameters when you train cook-spanish? I extracted 300 frames from each video for training. The camera parameters used the official npy file, points3D.ply generated by colmap. The psnr I trained is not as high as what you tested. Should I reduce the number of frames used?

ingra14m commented 1 year ago

Of course, reducing the number of frames can increase PSNR. However, I believe the inherent reason why GS-based methods struggle with multi-view datasets is that 3D-GS relies more on rich viewpoints compared to Nerf (a common issue in point-based rendering), rather than on point cloud initialization or frame length (indeed, it's challenging to achieve long-capture based on the deformation field).

If someone can address the ability of 3D-GS to model sparse scenes effectively, I believe our method can model multi-view scenes without relying on tricks.