What is wrong with my model? + summary & solutions to F.A.Q

microsoft / singleshotpose

This research project implements a real-time object detection and pose estimation method as described in the paper, Tekin et al. "Real-Time Seamless Single Shot 6D Object Pose Prediction", CVPR 2018. (https://arxiv.org/abs/1711.08848).

MIT License

721 stars 214 forks source link

What is wrong with my model? + summary & solutions to F.A.Q #75

Closed jgcbrouns closed 5 years ago

jgcbrouns commented 5 years ago

Hi everyone!

I was already having a discussions about my issues in issue-68, but decided to open a separate ticket anyway for completeness towards other people. As of now. I am clueless what is wrong with my model. My workflow and solved issues are as follows:

Using Unity3D, I created a data-set with around 240 training-images and 60 test-images for a custom model: a cube with 4 different colors (for testing).

I came across multiple issues regarding the following:

I set my camera calibration intrinsic parameters as follows:
```
K = np.zeros((3, 3), dtype='float64')
K[0, 0], K[0, 2] = 320, 320
K[1, 1], K[1, 2] = 320, 240
K[2, 2] = 1.
```
Annotated labels are created automatically in Unity3D of which I expect there to be no camera distortion (for the intrinsic camera calibration). The original author's of the LINEMOD dataset use a Kinect camera that does have such a camera distortion. This internal camera calibration is necessary for among others the PnP-algorithm.
Initially, I had many problems with creating my .PLY files:
- Scaling them into meters. I think I managed this my .PLY file here:
- Apparently models have to be centered at their volumetric point of mass at the origin (0,0,0). I centered my objects at the bottom-surface, giving them an unwanted offset.
Bounding-box coordinates need to be in a specific order in their respective annotated label file. If this is not the case, the bounding-box that gets generated form the .PLY file and predictions from singleshotpose will be distorted completely. There are multiple issues on Github about this (among which issue-49
- Correct order of boundingbox coordinates in label file (green border):
- Incorrect order of boundingbox coordinates in label file:

Find here, an example of an image and a label file from my trainings-set.

HOWEVER.

I am still not obtaining correct results and I am unsure about how long to train my models for. In the implementation 700epochs are stated, yet if I train for 4000 epochs, my results are still not good:

after4000+epochs

How many epochs should a new object be trained for? NOTE: I am using benchvise/init.weights as my initial weights for my new model of the custom data-set. This while my loss function goes down properly, but my accuracy measurements stay at 0%:

4000+epochs_terminal

Could there still be a problem with how I created the annotation files, camera intrinsic parameters or .PLY model. Or could there be another problem that I am not considering?

@btekin Would it be an idea to add a F.A.Q section to the README, using my findings? I think the section about training on a custom data-set could use a lot more elaboration.

Moreover, I am curious as to what people are doing with singleshotpose. Anyone experimenting with some interesting use-cases?

Many thanks for anyone that can help!

jgcbrouns commented 5 years ago

I managed to get my "Acc using x vx 3D transformation up to 20 - 30%" using a more appropriate diameter in the .data file. I calculated this diameter using a piece of code that pair-wise compares all vertices and takes the pair with the greatest distance.

Unfortunately I am not able to get any accuracy at the other 2 metrics (Acc using 5px 2D projection and Acc using 5 cm 5 degree metric). It seems that the model is unable to learn from my data? Might the data be too monotonous? If so I would expect to at least see some accuracy metrics go up (because the model would just over-fit).

jgcbrouns commented 5 years ago

Too bad that nobody seems to be able (or willing) to help. I have a little update:

Instead of 240 training-images, I now used 4000 training-images.
I also added random-background colors and added more colors to the cube (since I thought that the model was unable to learn about my cube object due to insuccifient features in texture).

My accuracy after 1000 epochs is: 56492513_2003465433296047_4332711249553915904_n

56956445_416356655794769_8403792676486381568_n

What I find weird is that LINEMOD has only ±200 images per class for training (and ±1000 for testing). Hence, their training-set is very small, yet @btekin was able to reach high accuracy >90%. @btekin, I would love to hear your input about this. Is this due to pre-training? Can you tell us a bit more about how long your pretraining took and what parameters you used in your training/test set?

Sidenote: I do believe that the trainings-images of LINEMOD are selected in such a way that together they represent a good 3D presentation of an object. In other words, every image in the trainigsset features the object from a unique POV (unique camera angle, position etc). Could this be of influence on why my model cannot seem to learn with only 240 images? On the other hand, I do expect the model to at least learn something with a small datset, when increasing the amount of epochs to train for; the model would overfit but still should learn and return high accuracies on the training-set.

Any thoughts?

btekin commented 5 years ago

Hello, thanks for your interest in our code and sorry for the late reply, I didn't have the time to reply as I had to deal with my other work related projects.

We follow the same training/test splits with earlier work, e.g. that of the BB8 paper by Rad & Lepetit, ICCV'17. Indeed, the training examples are sampled such that they cover a wide variety of viewpoints around the object. Having a more representative training set should, in principle, increase the accuracy on test examples.

Instead of using initialization weights for another object, you could also pretrain the network on the same object by setting the regularization parameter for the confidence loss to zero, as explained in the readme file. See also the discussion in the paper and in #79 why such pretraining could be useful.

About your findings: We already mention what order should be used for the keypoints in this link along with a step-by-step guide and this was also discussed in the duplicate issue #68. Custom datasets have different camera intrinsics matrix and might have different object models/scales. The scale of the object model should certainly be consistent and be set appropriately for a new dataset.

jgcbrouns commented 5 years ago

Hello @btekin. Thank you for your answer, yet I would like to ask you to be more specific.

The LINEMOD dataset uses ±200 images for training (while having 1200 images per class: 1000 remain for testing). How can ANY model learn from only 200 images? Aren't NNs like YOLO supposed to have thousands of images per class?

My model cannot learn anything when using 200 images in my custom data-set; loss-functions go down (<0 for x,y loss and confidence) yet accuracy metrics stay at 0.
When I use 5000 images per class, accuracy maxes out at 20% (note that I am using synthetic data)

_Sidenotes:

I am using lower batch-size of 8
I am not using segmentation masks

You speak about pre-training being necessary. Can you be more elaborate about this? how long did you train for? How many images per class? 200 again? How many epochs.

At this point I am distraught and about to give up...

MohamadJaber1 commented 5 years ago

Same problem here. I raised an issue #85 mentioning that my model also is not learning. And the thing is that, while validating, I can only use the first trained model which is after epoch 11. Even if my model keeps on training, it never updates the weights means that its never getting any better. It just save the summary in costs.npz file

I used a very good source for generating my synthetic data, I have masked images too that are corresponding to my RGB images, correct intrinsics, precise labelling files, exact diam value and rightly scaled .ply file. I am using 1170 images with 65 different orientation of the object and different background and just using one class (object). I also previously tried the 15% training/85% testing ratio split but also didn't learn.

SSP_revised_1

SSP_fail

Note: The maximum number of epochs I have reached was 177, but non of those weights was saved and the one at epoch 11 was only saved and never updated.

After comparing my inputs with that of ape.data and finding everything is matching yet not learning for custom data, I am also about to give up...

jgcbrouns commented 5 years ago

Hi @MohamadJaber1

Your green ground truth box does seem to be an incorrect bounding-box though. Maybe there still is something wrong with the labeling for your case? If you see my green ground-truth bounding boxes, they are exactly matching the object. It is indeed the case that the code is written in such a way that it will not save weights when there is no increase in accuracy. You can add a line of code to make it save weights after every 10 epochs or so. I tried this as well, but it is not helpful; if the accuracy does not increase, you can save weights all you want, but they are useless. [edit]: @MohamadJaber1. What batch-size are you using? According to your post: (.cfg file changed to batch size = 4 and subdivision = 4 as it was showing that CUDA is out of memory)

I have just now tried to train a model on 16.000+ images. I stopped the training at around 310 epochs, because I don't have the time to continue training this model, since I am only using a Nvidia 1080ti. Interesting enough, the loss function is pretty low, yet accuracy does not rise.

At this point I think that in general, 1000 images for training should do the trick. Because I can only use a batch-size of max 8, I think that I have to train for far more epochs than proposed in the code (700 epochs). It would be nice to hear @btekin input 😄

MohamadJaber1 commented 5 years ago

Hi @jgcbrouns, good to hear from you and hope to hear also from @btekin soon 😄

Thats true, even my ground truth labels aren't matching which is quite strange. I forked @juanmed singleshotpose as he used the same source for generating the data I used and he also created a script called _ndds_datasetcreator.py that inputs your 3D bounding box configurations and output a label text file that is compatible with singleshotpose. You can also visualize the points (labels) at each of your images. If you can write the way you created your label files.

For saving the model, I know I can modify the script by saving whatever but as you said it is useless if the model isn't getting any better. But my question was, why from the first place my model isn't getting any better? Why accuracy is 0?

By CUDA out of memory, I meant I reduced the batch size and subdivision to 4 as the yolo-pose-pre.cfg was having a batch size of 32.

I think that the original singleshotpose using LINEMOD dataset was trained over than 700 epochs in way that they kept on updating the initialization weights with the trained model and trained allover again to improve accuracy or changed the 700 epochs to any other value in thousands.

jgcbrouns commented 5 years ago

I also think that # epochs was way in the thousands.

"But my question was, why from the first place my model isn't getting any better? Why accuracy is 0?"

For your case it is pretty straightforward I think: fix the bounding box corners (check for ground truth box correctness) and your model will learn at least something (like mine). Your other settings seem to be correct: .ply file, diameter of ply vertices, camera intrinsics. Another tip: check if your .ply file has multiple vertex points. I see that your object is a lego-block. A cube in general can be modeled as a parametric 3d model. The linemod objects all have many vertices and edges in their model. @btekin uses the individual vertices to calculate accuracies against. In my hypothesis, more vertices adds more change for higher accuracy. What you could try is add more vertices in your model via blender:

open 3d .ply model -> select all vertices and edges -> go into edit mode -> press 'w' --> click 'subdivide'

But again, before that, fix the ground truth boundingbox label coordinates :P

I looked at the Nvidia data generator, but decided to create my own tool to label data in a Unity environment. Its more straightforward than the Nvidia tool. I rule out any mistakes in the way I generate the data and the labels since the ground-truths are correct. If you want, you can try my tool as well.

btekin commented 5 years ago

@jgcbrouns @MohamadJabar1 Thank you for your kind feedbacks 😄

Hello, a maximum number of 700 epochs were set for training but the model always converged much earlier than that.
The model was trained and validated on the LINEMOD dataset. Depending on your custom data, training might proceed in a different way and might require different number of epochs, learning rates, batch sizes learning schedule etc. I would be interested in looking at your own custom synthetic datasets if you could share it to better understand what problems you are having.
For LINEMOD, we use the standard training/test splits and apply extensive data augmentation by changing the background with segmentation masks, random scaling/translation, etc. Without using segmentation masks the accuracy might not be good enough because of the lack of generalization ability. To increase generalization, you could change the background of your images using segmentation masks or increase the number of your training samples.
Please also check #84 to see if your problem comes from this. @jgcbrouns it seems that your corner predictions seem accurate, however solvepnp method of the opencv version you are using might be returning inaccurate results. Could you visualize your predictions before pnp and let me know if they are accurate?
@MohamadJaber1 Symmetric objects with uniform texture generally brings additional challenges because of the pose ambiguity. Could you also try with non-symmetric or well textured objects and let me know how it performs.

jgcbrouns commented 5 years ago

hi @btekin . Thanks for your respons!

I think that the images that I posted above visualize the individual corners before PnP, straight from the predictions (red) and ground truths (green): ax.scatter(corners3D[0], corners3D[1], -corners3D[2], zdir='z', c= 'red') ax.scatter(vertices[0], vertices[1], -vertices[2], zdir='z', c= 'red') where vertices and corners3D is: vertices = np.c_[np.array(mesh.vertices), np.ones((len(mesh.vertices), 1))].transpose() corners3D = get_3D_corners(vertices)

It would be AWESOME if you could take a look at my dataset!! I am breaking my head about this every day 😅 🔫

Dataset: Google Drive link ~8MB, 300 images
cube.data
yolo-pose.cfg

[edit] I just looked at the ransac replacement algo for OpenCV right now in validation procedure. There is unfortunately no difference. I will now attempt train a small model with it.

MohamadJaber1 commented 5 years ago

Thank you both for your replies @jgcbrouns @btekin for your reply. I will follow your comments and try them on Tuesday - Wish you both Happy Easter 😄)

@jgcbrouns Reason why I am using NDDS is that I want later to validate my model with an image taken from a robot software environment and NDDS provided me with all the necessities.

@btekin It would be really great if you can take a look at our dataset (I will provide a sample)

btekin commented 5 years ago

@jgcbrouns Thank you for providing your dataset. After inspecting examples from your dataset, I would suggest you to do the two following things and see if they help:

Reduce color data augmentation. This could be useful because your object does not have any texture and the only cue that is useful to predict the pose is color. When you apply color data augmentation (changing hue, saturation values etc.) during training, the model has difficulty in distinguishing between different colors and hence estimating the pose. The current setting for the color data augmentation could be too high for your data. You can change the values for color data augmentation at the following lines: https://github.com/Microsoft/singleshotpose/blob/master/dataset.py#L67:L69
Randomly change the background. This could be useful as you have a small number of training examples and although you use different backgrounds for different training images, having static backgrounds for the same examples might result in overfitting.

Hope these pointers might help your problem. Please let me know how it goes.

@MohamadJaber1 As @jgcbrouns pointed out, I think you would need to fix the bounding box label coordinates in order for the network to start learning. If you provide a sample, I could also take a look at your data.

jgcbrouns commented 5 years ago

Hi @btekin and @MohamadJaber1

Thank you @btekin for looking at my dataset. Coincidentally I managed to gain some results this morning by converting my images (which are .png) to .jpg. I suspected that this could be of influence since .png can have transparent pixels in its images. Moreover I extended my tool with a mask over objects. Every object now has a mask as well. Either one of these changes fixed the problem for me.

MohamadJaber1 commented 5 years ago

@jgcbrouns good to hear that, I will also convert the types of my images to jpg later I tried your dataset and checked the ground truth labels - they were matching fine! Cube_gt At the cmd window, I am writing python train.py cfg/cube.data cfg/yolo-pose.cfg backup/benchvise/init.weights

But the problem is that after epoch 11, the model.weights is saved once and never updated. I trained the model for over than 140 epochs but even though never updated. cube_training

@btekin Thank you so much for offering this help.

Google driver link 400 RGB images, 400 masked images, 400 labels files, Lego.ply, test.txt, train.txt, training_range.txt (.json files from NDDS). So originally the images are **18 same pose images***65 different orientations = 1170 total images.
Lego.data
yolo-pose.cfg

Please let me know what can you observe and how can I solve it Quick Update: I think that the labellings are right, but the problem might be from there order, from label_file_creation.md, point 2 we can know the order, but it depends up to the convention of the coordinate system singleshotpose is expecting. I suspected that (x:- right, y:- up, z:- front)

jgcbrouns commented 5 years ago

Hi @MohamadJaber1

0% acuracies after 100 epochs implies that there is something wrong for sure. The model is supposed to converge rather quickly (on average at epoch #30, the model starts showing accuracy increases. Before that, its stays at 0%).

Here is my final dataset including masks and .jpg files (converted from .png): Google Drive link You can try training with this dataset. It should work 😄

I'll take a look at your dataset now. P.S. @MohamadJaber1 - If you want, we can stay in touch. You can hit me up at jeroen.brouns@philips.com

jgcbrouns commented 5 years ago

About your quick update: labels are indeed ordered with a specific coordinate system in mind. Unity (where I create my dataset has a different coordinate system than blender for example). The order is important because in @btekin code, this order gets interpreted and adhered to: link

You can validate if your order is correct:

Open your .PLY model in blender via import.
Your first x,y coordinate in your label file is the center, subsequently it should be the red dot as visible in the picture. This first cornercoordinate has the minimum values for x, y z. In blender you can see that this indeed matches.

Could you upload your dataset as 1 zip file that includes everything? .PLY model, generated (normalized) labels, images, masks etc. That makes it easier for me to test it.

MohamadJaber1 commented 5 years ago

Hi @jgcbrouns Thank you for this clarification, I am so glad to know that your dataset is working with the model. This gives me hope and motivation to dig more and fix mine also :smile:

Yea sure, it would be nice to contact you. Expect a mail soon :sunglasses:

I am currently training on your dataset to see if it will converge with me. Also, I will go few steps backwards to check all my own custom data (mostly the labels). For the zipped version of my data, please find it here Google drive link for my 1170 images Thank you very much

(My .PLY apparently needed to be scaled)

MohamadJaber1 commented 5 years ago

Hi @jgcbrouns Your dataset is learning with the model pretty good :smile: cube_trained_well This was trained for almost 60 epochs and its already showing good results :+1:

danieldimit commented 5 years ago

@jgcbrouns thank you for all the info you've shared in this thread. I am currently trying to figure out what's wrong with my dataset (or dataset generating tool). I am trying to learn on the dataset you've provided in this thread. After that I would try to generate the same dataset with the tool I am using to see if it's working as expected.

Could you please provide the texture that you've used for your object and also if possible your data generating tool?