lvwj19 / PPR-Net-plus

PPR-Net++: Accurate 6D Pose Estimation in Stacked Scenarios
Apache License 2.0
38 stars 6 forks source link

training problem #19

Open s2137127 opened 1 year ago

s2137127 commented 1 year ago

Hello, dear author. Currently, I have encountered the following issues during my training process, and I am unable to resolve them:

  1. When I train using the ringscrew data from the IPA Bin Picking Dataset, I achieve an AP (Average Precision) of 0.8. However, when I generate ringscrew data using Blender, the AP drops to only 0.2. On the other hand, if I train using the TLESS22 dataset generated from Blender, the AP can reach as high as 1. Additionally, when I use other asymmetric objects, such as a doorknob, generated data for training, the AP is 0.9. I would like to understand why I am encountering this issue.

  2. If I use symmetric objects like ringscrews or candlesticks as models for training data, how should I set the value of G (the set of rigid transformations that have no effect on the static state of the object)? Alternatively, could you please provide me with the G values you used for training Pepper and TLESS-20 for reference? Thank you.

I am seeking your guidance or insights to help me address these problems. Thank you.

ShuttLeT commented 1 year ago

Thanks for your attention! For your first question, you can try the following method to solve the problem:

  1. Check if there is any mistake with the labels on the ringscrew dataset you created yourself.
  2. Check if you modify the ObjectType when you shift to the ringscrew dataset.

For your second question, G is a set of rotation matrix. When you apply one of the rotation matrix in G to an object, the space occupied by the object will not be changed. Intuitively, you can imagine what will happen when you rotate a Rectangular cuboid 180 degrees around any axis. Additionally, the G values of Pepper and tless20 can be found in Sileane dataset. The download link is under README, You can find G in the poseutil.json.

s2137127 commented 1 year ago

Thank you for your thoughtful response. I have a few more questions to ask:

  1. How is the distance threshold set for pose recovery evaluation?
  2. I am using the Sileane Dataset tless20 for training, and the object type is defined as follows: type_tless20 = ObjectType(type_name='tless', class_idx=0, symmetry_type='finite', lambda_p=[[0.0155485, 0.0, 0.0], [0.0,0.0248085, 0.0], [-0.0, 0.0, 0.0171969]], G=[ [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], [[ -1.0, 0.0, 0.0],[ 0.0, -1.0, 0.0],[ 0.0, 0.0, 1.00]]])

    The internal parameters are based on the specifications provided by the Sileane Dataset. The model configuration is taken from your provided source on GitHub. The poseutile.json used for pose recovery evaluation is also provided by the Sileane Dataset. During training, the data consists of cycles from 0 to 250 with object numbers ranging from 0 to 10. The same range is used for testing as well. However, the average precision (ap) of the training results is observed to be 0.3. Is there something I might have missed or not modified correctly? I have noticed that the training results for symmetric objects are not good, but I am unsure of the underlying reason. I would appreciate it if you could provide further clarification when you have the time. Thank you.

ShuttLeT commented 1 year ago

For your first question, generally, the distance threshold is one tenth of the diameter of the object’s surrounding ball. For your second question, You can try to visualize the prediction results so as to locate if the problem is in the prediction or evaluation stage. Then you can carefully check the code and locate the bug. The visualize code is in the evaluate.py.

s2137127 commented 1 year ago

Thank you for your previous response. I have successfully conducted pose estimation using a self-generated dataset, achieving high accuracy. However, I have encountered an issue. While training and testing with 1-10 objects yield excellent results, the prediction of central positions deteriorates significantly when training and testing with 1-10+ objects. This leads to poor clustering and weak pose estimation. Could you please explain the possible reasons behind this? I have also trained and tested with 1-30 objects using the open datasets from your paper, and the results were quite good.

ShuttLeT commented 1 year ago

You can check from the following aspects:

  1. Visualize the training set labels and check if there are any mistakes with the labels.
  2. Check the dataloader used in training phase. It may only load scenes with 1-10 objects.
s2137127 commented 1 year ago

Hello, I have the following two questions that I would like to ask:

  1. The formula provided by the IPA dataset for converting depth maps to point clouds is given as follows: Xcs = (us - cx) Zcs / fx, Ycs = (vs - cy) Zcs / fy. However, in the generate_train_dataset.py script provided by you for converting to point clouds, the formula used is: Xcs = - (us - cx) Zcs / fx, Ycs = - (vs - cy) Zcs / fy. Could you please explain the reason for the additional negative sign?
  2. I've noticed that the coordinate systems of Blender's camera and the IPA dataset's camera seem to have different orientations. This has resulted in a discrepancy between the poses visualized from Blender's output and the point cloud. Currently, I've modified the code for depth-to-point-cloud conversion by changing Xcs and Ycs calculations to Xcs = (us - cx) Zcs / fx, Ycs = (vs - cy) Zcs / fy, which aligns the visualized poses with the point cloud. However, the training results are still unsatisfactory (eval distance only goes down to 50, rotation loss is at a minimum of 22, and translation loss is at a minimum of 13). Could you please advise on other parts of the code that might need adjustments besides the mentioned Xcs and Ycs if my coordinate system differs from that of the IPA dataset?
ShuttLeT commented 1 year ago

1.If the training loss is small but the eval loss is large, it may indicate the coordinate system between the training set and the test set are different. Try visualizize them for inspection. 2.If the training loss and eval loss are both large, it indicates that the label still has problems. 3.Check if the translation labels and point clouds are both converted to mm during training may help.

s2137127 commented 1 year ago

Hello, currently, I am using pybullet to generate a dataset and train a model. The model's accuracy in predicting stacked synthetic images is quite high. However, when it comes to recognizing real images, if there is a slight stacking in the predicted images, the results are not good. When visualizing the movement of each point to its predicted center position, I notice that each set of points cannot be completely separated. However, if the images are not stacked, the recognition is somewhat acceptable, at least the points for each object are clustered in their respective regions. What could be the reasons for the model's inability to predict the stacking of two objects accurately?

ShuttLeT commented 1 year ago

You can try to adjust the bandwidth and min_bin_freq parameters in meanshift.

s2137127 commented 1 year ago

What I mean is that the predict centroids in the virtual data are well separated, but in the case of stacked objects in real data, its predict centroid cannot be completely separated.Do you perform any preprocessing after capturing real images before testing?

ShuttLeT commented 1 year ago

1.you can check if there is a significant domain gap between the real and synthetic dataset, the domain gap will diminish the performance of network on real dataset. 2.You can use domain randomization on the simulated dataset, e.g., adding noise, to improve the network's performance on the real dataset.

  1. You can perform preprocessing techniques on the real data, e.g., background subtraction and filtering.