Question: Documentation

Hello I was wondering if you could provide a bit more documentation on how to train on own dataset. Say I want to use this project for getting the 6D pose of chairs. Could you clarify:

what format does my dataset need?
what about the labeling?
where to place what in the folder structure?
finally, how do we run the training and then evaluate results (commands to execute)?

Loved your project and your youtube series. I'm a huge fan and I am excited to learn from your steps!

Good question! I'll answer here and also add a section to the readme on this.

The neural net model is trained to detect either the 2d coordinates of a set of 3d object keypoints on an image, the pixels associated with an object of interest, or both. The functions used to generate the target data for the neural nets is found in the data.py file. (coordsTrainingGenerator, classTrainingGenerator, combinedTrainingGenerator)

These functions read data from the LINEMOD dataset, one of several datasets used in academic works in 6d pose estimation. I show the folder and talk about the data format in this video in the youtube series, but did not include the folder in this repo due to size constraints. The dataset contains a folder for each object of interest, and within that folder, there is a JPEGImages folder, a labels folder, and a mask folder. JPEGImages contains the RGB images which are converted to numpy arrays and used as the input data for the neural net.

The mask folder contains a corresponding set of images that are made up of black pixels for pixels not associated with the object of interest, or white pixels for pixels associated with the object of interest. A (HxWx1) array is generated indicating whether a pixel belongs to the object of interest, which is used as target data for the class and combined generators.

The labels folder contains a corresponding .txt file for each image in JPEGImages, which gives information on the pixel location of the 9 bounding box keypoints. The format of these files is {object classification tag} {x1} {y1} {x2} {y2} ... {x9} {y9} the object classification tag denotes the object associated with the coordinates. Each of the x or y values is a value between 0 and 1 that gives the relative coordinate on the image (e.x. if x1 and y1 are .1 and .5 on a 640x480 image, the keypoint is located at pixel (64, 240)). In all generators we generate a modelMask array, and for each pixel belonging to the object of interest, we calculate a unit vector from the pixel to each 2d keypoint. The end result is a (HxWx18) array, containing a set of 9 unit vectors for each pixel that belongs to the object.

So to answer your question, this project could be used on any object given a dataset of the same format for the object. That is: a set of photos of the object of interest, a corresponding set of object masks of those photos identifying which pixels belong to the object, and a corresponding set of .txt files which give the 2d pixel locations of a set of 3d keypoints.

To train a model on this data, in models.py you would add a model from modelsDict to the modelSets list variable, specifying the modelSet class argument as your object of interest, this argument is what determines the object folder used during the training process. You would then need to call the trainModels function on the modelSets variable to train the model on the specified data. This process outputs and records loss and accuracy metrics which are save in stvNet\models\history, the data can be displayed or loaded by calling loadHistories or plotHistories using the same modelSets variable used to train the model.

Alternatively, you would not necessarily have to follow the structure described above, any sort of data format that lets you provide the training and target data would be fine, as long as you also add generators to format the data correctly.

Hello! Thank you so much for taking the time to get back to me, I appreciate it a lot. So if i understood correctly, to train on a new dataset, I must:

get JPEGs
make a mask for labeling the pixels in the image with the object in question --- Q2.1: do i use some tool for making these masks? --- Q2.2: do i need the ritual ground symbols that the dataset has around the objects (white and black printed symbols surrounding the "Training area")?
make the corresponding labels with the 2d coordinates for the 3d positioning of the bounding box --- Q3.1: do i use some tool to do this? otherwise to find out the x,y of each image...
Here is where i start to get confused.. once I have the dataset structured as above... : I add the model from modelsDict to the modelSets variable (loses the thread of thoughts).... I think some assumptions are being made here, that I don't remember where in the code I must be or how do I get to this point... so if we could maybe rephrase this paragraph please so that I understand what to do once I have the dataset properly formatted?:

To train a model on this data, in models.py you would add a model from modelsDict to the modelSets list variable, specifying the modelSet class argument as your object of interest, this argument is what determines the object folder used during the training process. You would then need to call the trainModels function on the modelSets variable to train the model on the specified data. This process outputs and records loss and accuracy metrics which are save in stvNet\models\history, the data can be displayed or loaded by calling loadHistories or plotHistories using the same modelSets variable used to train the model.

Q5. Also, I have been trying to look into "linemod ros" online and the documentation is 7 years old and apparently "openni" doesnt work with melodic, which is what I started using...So I cannot run the tutorials here. -- > Must I somehow use kinetic? My setup right now is a dockerized ros environment (melodic), and I connect to my cameras (real sense cams!) through the host's xserver (drivers on host).

Again, thanks for taking your time, your work is amazing 👍

1: do i use some tool for making these masks?

I used the pre-made masks provided in the dataset, so I don't have any recommendations on how to make your own custom masks. I don't see a way to automate the generation of a new dataset, as it would require a model already capable of identifying the information of interest, which obviously would need to have been trained on an existing dataset, which leads to sort of a "chicken before the egg" situation. It sounds like your goal is to use an original dataset so I think that this would require manual annotation.

2: do i need the ritual ground symbols that the dataset has around the objects (white and black printed symbols surrounding the "Training area")?

No these shouldn't be necessary.

make the corresponding labels with the 2d coordinates for the 3d positioning of the bounding box --- Q3.1: do i use some tool to do this? otherwise to find out the x,y of each image...

The most straightforward way would be to manually annotate the annotate the files, similar issues here as generating the mask data.

so if we could maybe rephrase this paragraph please so that I understand what to do once I have the dataset properly formatted?

This process begins on line 510 of the models.py file. The modelsDict is a dictionary containing key values pairs of the model name, and a modelDictVal class object. The object is defined here and simply collects a set of model specifications and hyperparameters. This dictionary is used as global variable, and is accessed whenever the models need to be loaded during the training and evaluation functions.

There is a separate modelSet class, which just associates a modelName, and the 'class' of the model (in this case, the object of interest in the linemod dataset that we want to train our model to locate). The modelName string should match one of the modelDictVal key names, as the modelSet.name attribute is used to load a modelDictVal entry during the aforementioned training and evaluation functions. The modelSets variable is just a list of modelSet objects, the functions in this file listed from line 539 on all iterate over an input argument, this was useful for training and comparing multiple models at once.

Q5. Also, I have been trying to look into "linemod ros" online...

I am not an ROS expert by any means and this project does not use ROS so I don't have an answer on this. In my personal experience I had a similar compatibility difficulties getting my ROS environment working while using older libraries during a separate project and was finally able to get it working by downgrading some of the packages, the most significant of which was downgrading from ROS2 to ROS. Generally I would recommend using the most similar environment to the original though.

Thanks a lot for getting back to me :+1: ! I will look into this then, and generate my own dataset as required (chairs). The manual-mask-making isnt a problem to me but the "make the corresponding labels with the 2d coordinates for the 3d positioning of the bounding box" is a big 'oof'. I need to think how to manually get the x,y for points i mark :/ I'll get back to you on further progress/blockers!

I'd recommend looking into the cv2 projectpoints function, basically given a rotation vector and translation vector between the camera and object, you can generate the 2d keypoints of any given 3d model. I used it here in this project to generate a set of labels for an alternate set of 3d keypoints. I feel this might be more reliable than trying to label the keypoints manually.

sgawalsh / stvNet

Question: Documentation #2