kxhit / EscherNet

[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis
https://kxhit.github.io/EscherNet
Other
299 stars 16 forks source link

3D Reconstruction for text-to-3D #6

Open kszpxxzmc opened 7 months ago

kszpxxzmc commented 7 months ago

when come?

kxhit commented 7 months ago

I think it's already out there. Pick an off-the-shelf text-to-image model, and get the images. Input the images to EscherNet to get novel views. Run NeuS to reconstruct 3D.

kszpxxzmc commented 7 months ago

I get pose.npy from eval_eschernet.py 476-489 as below:

            elif DATA_TYPE == "MVDream" or DATA_TYPE == "Text2Img":
                img_path = None
                azimuth, polar = angles_out[T_out_index]
                if CaPE_TYPE == "4DoF":
                    pose_out.append(torch.tensor([np.deg2rad(polar), np.deg2rad(azimuth), 0., 0.]))
                elif CaPE_TYPE == "6DoF":
                    pose = look_at(origin, xyzs[T_out_index], up)
                    pose = np.linalg.inv(pose)
                    pose[2, :] *= -1
                    pose_out.append(torch.from_numpy(get_pose(pose)))

                    print(len(pose_out))
                    if len(pose_out) == 100:
                        np.save('pose.npy', pose_out)

and use pose.npy in 3drecon/renderer.py 501 - 512 for mesh extract as below:

    for index in range(self.num_images):

        pose_path = '/mnt/petrelfs/xxx/xxx/EscherNet/pose.npy'
        pose_all = np.load(pose_path)
        # print(pose_all)
        pose = pose_all[index][:3, :]   # in blender
        self.poses.append(pose)
        theta, azimuth, radius = get_pose(pose)
        print(theta, azimuth, radius)
        self.azs.append(azimuth)
        self.els.append(theta)
        self.dists.append(radius)

and I get theta, azimuth, radius as below:

0.0007854] [-0.01] [1.5] [0.03220132] [-0.41] [1.5] [0.06361725] [-0.81] [1.5] [0.09503318] [-1.21] [1.5] [0.1264491] [-1.61] [1.5] [0.15786503] [-2.01] [1.5] [0.18928096] [-2.41] [1.5] [0.22069688] [-2.81] [1.5] [0.25211281] [3.07318531] [1.5] [0.28352874] [2.67318531] [1.5] [0.31494466] [2.27318531] [1.5] [0.34636059] [1.87318531] [1.5] [0.37777652] [1.47318531] [1.5] [0.40919244] [1.07318531] [1.5] [0.44060837] [0.67318531] [1.5] [0.4720243] [0.27318531] [1.5] [0.50344022] [-0.12681469] [1.5] [0.53485615] [-0.52681469] [1.5] [0.56627208] [-0.92681469] [1.5] [0.597688] [-1.32681469] [1.5] [0.62910393] [-1.72681469] [1.5] [0.66051986] [-2.12681469] [1.5] [0.69193578] [-2.52681469] [1.5] [0.72335171] [-2.92681469] [1.5] [0.75476764] [2.95637061] [1.5] [0.78618356] [2.55637061] [1.5] [0.81759949] [2.15637061] [1.5] [0.84901541] [1.75637061] [1.5] [0.88043134] [1.35637061] [1.5] [0.91184727] [0.95637061] [1.5] [0.94326319] [0.55637061] [1.5] [0.97467912] [0.15637061] [1.5] [1.00609505] [-0.24362939] [1.5] [1.03751097] [-0.64362939] [1.5] [1.0689269] [-1.04362939] [1.5] [1.10034283] [-1.44362939] [1.5] [1.13175875] [-1.84362939] [1.5] [1.16317468] [-2.24362939] [1.5] [1.19459061] [-2.64362939] [1.5] [1.22600653] [-3.04362939] [1.5] [1.25742246] [2.83955592] [1.5] [1.28883839] [2.43955592] [1.5] [1.32025431] [2.03955592] [1.5] [1.35167024] [1.63955592] [1.5] [1.38308617] [1.23955592] [1.5] [1.41450209] [0.83955592] [1.5] [1.44591802] [0.43955592] [1.5] [1.47733395] [0.03955592] [1.5] [1.50874987] [-0.36044408] [1.5] [1.5401658] [-0.76044408] [1.5] [1.57158172] [-1.16044408] [1.5] [1.60299765] [-1.56044408] [1.5] [1.63441358] [-1.96044408] [1.5] [1.6658295] [-2.36044408] [1.5] [1.69724543] [-2.76044408] [1.5] [1.72866136] [3.12274123] [1.5] [1.76007728] [2.72274123] [1.5] [1.79149321] [2.32274123] [1.5] [1.82290914] [1.92274123] [1.5] [1.85432506] [1.52274123] [1.5] [1.88574099] [1.12274123] [1.5] [1.91715692] [0.72274123] [1.5] [1.94857284] [0.32274123] [1.5] [1.97998877] [-0.07725877] [1.5] [2.0114047] [-0.47725877] [1.5] [2.04282062] [-0.87725877] [1.5] [2.07423655] [-1.27725877] [1.5] [2.10565248] [-1.67725877] [1.5] [2.1370684] [-2.07725877] [1.5] [2.16848433] [-2.47725877] [1.5] [2.19990026] [-2.87725877] [1.5] [2.23131618] [3.00592654] [1.5] [2.26273211] [2.60592654] [1.5] [2.29414804] [2.20592654] [1.5] [2.32556396] [1.80592654] [1.5] [2.35697989] [1.40592654] [1.5] [2.38839581] [1.00592654] [1.5] [2.41981174] [0.60592654] [1.5] [2.45122767] [0.20592654] [1.5] [2.48264359] [-0.19407346] [1.5] [2.51405952] [-0.59407346] [1.5] [2.54547545] [-0.99407346] [1.5] [2.57689137] [-1.39407346] [1.5] [2.6083073] [-1.79407346] [1.5] [2.63972323] [-2.19407346] [1.5] [2.67113915] [-2.59407346] [1.5] [2.70255508] [-2.99407346] [1.5] [2.73397101] [2.88911184] [1.5] [2.76538693] [2.48911184] [1.5] [2.79680286] [2.08911184] [1.5] [2.82821879] [1.68911184] [1.5] [2.85963471] [1.28911184] [1.5] [2.89105064] [0.88911184] [1.5] [2.92246657] [0.48911184] [1.5] [2.95388249] [0.08911184] [1.5] [2.98529842] [-0.31088816] [1.5] [3.01671435] [-0.71088816] [1.5] [3.04813027] [-1.11088816] [1.5] [3.0795462] [-1.51088816] [1.5] [3.11096213] [-1.91088816] [1.5]

However, I cannot reconstruct 3D object. Maybe you can answer how can I get pose.npy for N1M100 3D generation.

Dipan-Zhang commented 2 months ago

hey dear author,

thanks for this great work! I have a similar question about 3D reconstruction of from single image input.

I have tried 2 methods but so far didn't successed.

Method1:

use data_type='Text2Img' as you described above, and then eschernet generates multi-view according to poses created from get_archimedean_spiral.

But in render, according to this line https://github.com/kxhit/EscherNet/blob/10b650492ba97b5104a3136de07d1a67f4ada458/3drecon/renderer/renderer.py#L494 the render will use fixed camera poses in GSO dataset. So, there is a mismatch between then generated multi-view images and NeuS render and causing bad reconstruction quality.

Method 2

use 'data_type='GSO3D' and created a subfolder in similiar structure (for example 'bottle') in Data/GSO30/ and copied camera poses (*.npy from other cases). The generated images are already a mess since the camera poses are not perfect. I tried this because this way, the generation pipeline can have same cameras poses as 3D reconstruction.

Essentially, I want to achieve similiar results as the demo on project page locally with single image input. Can you maybe give some hints? Should I estimate the camera pose for using DUSt3R?

thanks!

kxhit commented 2 months ago

Thanks @Dipan-Zhang Yeah, you need to modify the camera coodinates accordingly. In the 3drecon code, the poses are assumed in GSO settings by default, we didn't focus on 3d recon as there are many methods to do so. It should be easy to modify accordingly.

Dipan-Zhang commented 2 months ago

got it, thanks a lot for the tip ;)