Real-time pose estimation

Lanzo98 commented 1 year ago

Hi, I'm trying to use the DPVO to get the pose for each frame in a real-time fashion, for example, printing xyz and quaternion. Do you have any hints for this? Looking at the function https://github.com/princeton-vl/DPVO/blob/4f2f0cc7efbfe2547e788844412a3a2a72a923bd/dpvo/dpvo.py#L153 I'm trying to print lietorch.SE3(slam.poses_[-1]).data.cpu().numpy(), but I always get value [0. 0. 0. 0. 0. 0. 1.].

Here is the modified loop inside the run function of demo.py:

while 1:
        (t, image, intrinsics) = queue.get()
        if t < 0: break

        image = torch.from_numpy(image).permute(2,0,1).cuda()
        intrinsics = torch.from_numpy(intrinsics).cuda()

        if slam is None:
            slam = DPVO(cfg, network, ht=image.shape[1], wd=image.shape[2], viz=viz)

        image = image.cuda()
        intrinsics = intrinsics.cuda()

        with Timer("SLAM", enabled=timeit):
            slam(t, image, intrinsics)

        print(lietorch.SE3(slam.poses_[-1]).data.cpu().numpy())

What am I doing wrong? Thanks in advance

lahavlipson commented 1 year ago

You should be doing print(lietorch.SE3(slam.poses_[n-1]).data.cpu().numpy())

EDIT: (See next comment)

Lanzo98 commented 1 year ago

Thanks, it works! I'm also struggling to understand the format of the pose stored in the variable poses_. SE3 should be as [x, y, z, qw, qx, qy, qz] right? I saved the values returned from lietorch.SE3(slam.poses_[n-1]).data.cpu().numpy() and the poses returned from the terminate function. They are pretty different (the test is done on the iPhone IMG_0492.MOV video), and I discovered this is due to the inverse function here https://github.com/princeton-vl/DPVO/blob/4f2f0cc7efbfe2547e788844412a3a2a72a923bd/dpvo/dpvo.py#L168 I cannot use this since I'm working on single pose values. I also find that some conversions are done for the viewer with CUDA. Is there a different way of displaying the correct values [x, y, z, qw, qx, qy, qz]? Do you have any hints?

lahavlipson commented 1 year ago

Internally, DPVO stores poses as a mapping from world coordinates to camera coordinates. The actual camera poses are the inverse of this, i.e. a mapping from camera coordinates to world coordinates. So to return the correct camera poses per-frame, you should actually do

lietorch.SE3(slam.poses_[n-1]).inv().data.cpu().numpy()

FYI different datasets and libraries represent rotation quaternions differently, usually either [qx, qy, qz, qw] (e.g. lietorch) or [qw, qx, qy, qz] (e.g. pytorch3d)

Lanzo98 commented 1 year ago

Thank you very much, all clear now.

senecobis commented 1 year ago

Only a side comment on this, @lahavlipson you are saying that the last pose is given by lietorch.SE3(slam.poses_[n-1]).inv().data.cpu().numpy() but n is the keyframe number, so you are not effectively giving the last pose of the last frame, but rather of the last keyframe that might be far in time, depending on your configs.

The real pose corresponding to time t (I guess that is what @Lanzo98 is asking) is only obtainable after running slam.terminate(), that interpolates the missing poses in-between keyframe and therefore the "real" pose at time t is got. Correct me if I'm wrong.

Is it really feasible in DPVO to get the estimated pose for each new frame ?

lahavlipson commented 1 year ago

@senecobis DPVO treats every new frame as a keyframe, and only removes keyframes when they are 3 timesteps old or fall out of the optimization window. So keyframes t-1, t-2 and t-3 are indeed the most recent 3 frames

senecobis commented 1 year ago

@lahavlipson but then what is the whole purpose of this code part inside dpvo evaluation script.

I understood that, if the estimated flow between i and j is less then self.cfg.KEYFRAME_THRESH that is 15.0 then we remove the last frame from keyframes. Where i is the 5th to last keyframe (since self.cfg.KEYFRAME_INDEX = 4) and j is the 3rd to last keyframe.

Or you were referring to training? in training is for sure true that every frame is keyframe.

lahavlipson commented 1 year ago

@senecobis Your understanding of the code part is almost correct; we don't remove the last frame if the flow between (n-3) and (n-5) is small, we remove the in-between frame (n-4)

spokV commented 7 months ago

Hi @lahavlipson, I'm publishing the result of lietorch.SE3(slam.poses_[n-1]).inv().data.cpu().numpy() in a PoseStamped ros message. I can view the position correctly but the rotation quaternions seems wrong (please see image). Do you have any idea of why? I'm taking into account that the format is [qx, qy, qz, qw] of the pose vector. Screenshot from 2024-02-07 11-51-49

princeton-vl / DPVO

Real-time pose estimation #27