Question about the pose parsing in TartanAir dataset

princeton-vl / DROID-SLAM

BSD 3-Clause "New" or "Revised" License

1.66k stars 273 forks source link

Question about the pose parsing in TartanAir dataset #14

Closed xwjabc closed 2 years ago

xwjabc commented 2 years ago

Hi, thank you for your great work! Recently when I read the data loading part of Droid-SLAM, I found the pose is parsed as:

poses = np.loadtxt(osp.join(scene, 'pose_left.txt'), delimiter=' ')
poses = poses[:, [1, 2, 0, 4, 5, 3, 6]]

From the TartanAir tools, the line of the pose data file has the format tx ty tz qx qy qz qw, which uses a NED frame. In your implementation it seems you convert XYZ to YZX coordinates. I wonder if there is any reason behind it? Thanks!

zachteed commented 2 years ago

Hi, I'm converting the poses so that the directions are X = right / left Y = up / down Z = forward / backward

This is the typical format used by most datasets I've encountered like ETH3D and TUM-RGBD.

The main reason for this conversion is that this is the format that I'm most comfortable working with. Also leads to simpler equations for projecting points between images.

xwjabc commented 2 years ago

Got it. Thank you for your quick reply!

xwjabc commented 2 years ago

For the pose parsing, I have a follow-up question: In this line, the pose matrix is inverted. I wonder if the reason is, the original pose is defined as camera-to-world transformation and here we convert it into world-to-camera transformation?

Based on this interpretation, I draw a figure to show my understanding of Eq.(3) (I have already converted the axes from NED to your setting, which seems to be called as "CAM" in tartanair_tools):

I wonder if my understanding is correct? Thanks!

zachteed commented 2 years ago

Yes that all looks correct to me. The datasets represent poses as camera-to-world transformations. Droid-slam estimates world-to-camera poses which get converted back to camera-to-world for evaluation.

xwjabc commented 2 years ago

Gotcha. Thank you for your clarification!

Rtut654 commented 11 months ago

@xwjabc Hey! I know it is a bit offtop, but have you trained the model without depth, only with raw images and poses? This issue states some troubles doing so. I wonder how to implement it if it's possible. Thanks in advance.