jimmyyhwu / pose-interpreter-networks

Real-time robotic object pose estimation with deep learning
MIT License
122 stars 27 forks source link

Pose estimation has huge position and orientation error for one object and does not appear to decrease #32

Closed menonmeg closed 3 years ago

menonmeg commented 3 years ago

Hello,

So I generated my own training set and am attempting to train the pose estimation portion with just one object, I was able to go through all the steps to get the CAD model properly setup, the PCD files and do all the stuff with the redis-server. I ended up attempting to train the model, and after 3000 epochs, I find the error is quite large, roughly the same as what it was at the start (37889.62 m for position, 127.81 m for orientation). Any advice on what I might be doing incorrectly?

menonmeg commented 3 years ago

I trained the mask yml file for this, is it necessary to train the object file instead?

menonmeg commented 3 years ago

Also sorry one more point to note, I am also using Microsoft Kinect cameras and set the camera scale to 0.5 although the object of interest is a golf cart that is moving around on the ground (i.e. it would not translate in Z or rotate in roll or pitch, just yaw)

jimmyyhwu commented 3 years ago

Do you have any examples of what the images and target poses look like?

If you haven't done so already, it would be good to verify that it can train properly with the provided objects first.

menonmeg commented 3 years ago

Here are some mask images and poses for the object I am trying to track, it currently uses the default position and orientation generation (i.e. I did not limit position and orientation of the robot currently), also our camera is mounted at a specific angle if that can be incorporated into the training)

After further investigation, the error appears to be linearly increasing as the number of epochs increase, is this expected behaviour?

menonmeg commented 3 years ago

mask_00000001 mask_00000002 mask_00000003 mask_00000004 mask_00000005 object_00000001 object_00000002 object_00000003 object_00000004 object_00000005 subset_00000001.txt

menonmeg commented 3 years ago

I also have a better CAD model of the object generated from a 3D reconstruction of the object, but I think this issue is more related to the training of the pose estimation than the CAD model itself since they are trained separately as discussed in your paper

menonmeg commented 3 years ago

And just confirmed, yes I am able to train the original data without issue and the error appears quite small and definitely decreasing in size

jimmyyhwu commented 3 years ago

Thanks, that's helpful. Your CAD model should be fine. Definitely try the object rather than mask first, since it will be easier to predict the pose with more texture in the image.

Another sanity check you can do is to visualize the ground truth poses and make sure they look correct. You can modify this notebook to do so.

Replace

rendered_pose = pose_renderers[object_index[i]].render(position[i], orientation[i])

with

rendered_pose = pose_renderers[object_index[i]].render(target[:, :3], target[:, 3:])

and it will render the ground truth poses rather than the predicted poses.

menonmeg commented 3 years ago

Hello Jimmy,

Thanks for the advice! I seem to be have a problem visualizing the pose data as you mentioned, and I was wondering if you had any ideas about this? image(1) image(2)

Also, I attempted to decrease the learning rate by two orders of magnitude and at least for training purposes, the error appears to be decreasing at a steady rate and does not seem to be exploding anymore, does this make sense to you? I am also <100 epochs in to training though so I will watch it carefully. Please let me know what you think! Thanks for the help!

menonmeg commented 3 years ago

Hello Jimmy,

I was able to determine the issue with visualization was that the target output was still on the GPU and not converted to numpy, I am able to visualize the results now. If you have any thoughts on my tuning of the learning rate let me know.

Also if you have a good idea of how to constrain the positions of the object to only looking at the top (i.e. never having the object be flipped over to view the underside), please let me know as well.

Thanks so much, Meghna

jimmyyhwu commented 3 years ago

Yes, tuning the learning rate for your dataset is a good strategy. Your golf cart is a lot larger in size than the objects we used, so that might be why the default learning rate is not working.

Adding constraints to poses should be pretty straightforward, just edit the code here to add whatever constraints you want.

If you want to have only upright golf carts, one way is to check the orientation and make sure it is very close to vertical. There is a function here to compute the difference between two rotations, which you can adapt.

menonmeg commented 3 years ago

Hello Jimmy,

Thanks for the advice, I am working on tuning the functions in the two files you mention. We also have ground truth data from the golf cart for the center, and I am working on incorporating that as ground truth data for positioning the CAD model so it will better reflect what is actually seen from the cameras.

It appears my training is hitting a wall where it does not seem to be able to have a position error of less than 30 cm and 17 deg on average. I was wondering if you also tried using the depth data from the Kinect for any part of this experiment? I.e. for training with the segmentation model, using this to train the pose estimation model etc.?

jimmyyhwu commented 3 years ago

Are those results for the end-to-end version with the mask network? If so, I guess that seems on par with what we found in the paper. The position error is larger but perhaps that is proportional to the size of the golf cart, which is much larger than the objects we used.

I think improving the pose distribution of the training data as you are doing right now (no upside down carts, or no carts really close to the camera) should be helpful as it will train with poses that are more realistic.

Regarding the depth data, we tried it but did not explore it very far. You can look into related work in RGB-D segmentation, and perhaps sim-to-real transfer for depth data, since the real depth data is very noisy.

jimmyyhwu commented 3 years ago

Forgot to mention, one other thing you can try is to reduce the size of the golf cart by a factor of 10. Then, it should probably train fine with the original learning rate.

For prediction, you can scale the predicted position back up by the same factor 10.

menonmeg commented 3 years ago

Hi Jimmy, hope you are well.. It has been some time since we last spoke, and I wanted to ask a quick clarifying question - for your CAD models, what was the 0,0,0 position corresponding to in the data? Was it the center of the object with +x forward, +y right and +z up? I.e. is there a way to confirm what the pose is for each training image generated so I could confirm that it matches what our RGB images look like?

After performing a transform from our world 0,0,0 to camera 0,0,0 in order to provide a list of poses for the tracked object, the object of interest no longer appears in most of the generated image frames. I have attached our pose list file and some subsequent images which are generated, that do not seem to show the object in frame although the object is clearly in those frames from our ground truth data. GEM.txt

mask_00000026 mask_00000035

Also, here is the model in Blender, which I flipped on its side in an attempt to make the PCD orient correctly image

jimmyyhwu commented 3 years ago

There is a pose renderer in utils.py that you can use to render the object in arbitrary poses.

You can find example usage in this notebook.