Questions about the output layer of the pose interpreter network

gfdeng commented 1 year ago

Hello, after reading your paper, what I understand is that the pose interpreter network is to output the position and orientation of 5 objects at once, so we need to set the final output of the network to be 5×3 positions and 5×4 orientations. (Take 5 types of objects as an example).

But after looking at the code of the pose interpreter network(pose-interpreter-networks/pose_estimation/models.py: line 68-75), I found that the logic of this network is that the network inputs a mask of one object and the corresponding object id. The network first outputs 5×3 positions and 5×4 orientations, and then determines which one to choose as the final predicted value according to the object id. In the end-to-end model(pose-interpreter-networks/models.py: line 45-55) number of times the pose interpreter network runs is equivalent to numbers of objects segmentation network output.

When training this network(pose-interpreter-networks/pose_estimation/train.py: line 111-118), only 3 positions and 4 orientations are compared with the target value, which is equivalent to the remaining 4×3 positions and 4×4 orientations are meaningless. (I don't know if my understanding is correct)

If my understanding is correct, then why does the final output of the network need to be related to the number of object types , can it be directly set to output 3 positions and 4 orientations? If my understanding is wrong, I would appreciate it if you could explain the posture interpreter network。

gfdeng commented 1 year ago

Sorry, I probably understood, the output layer set to 5×3 positions and 5×4 orientations is to fit the corresponding object in order to fit the corresponding neuron, which can ensure better accuracy. (I don't know if my understanding is correct)

Then I have another two questions: 1.have you tried setting the output layer to 3 positions and 4 orientations, is the error very high or some other problems happened? 2.If I want to train for 10 objects(or even more), then I need to set 10×3 positions and 10×4 orientations for outputs, although there are more neurons, but only 3 positions and 4 orientations for each individual object. If so, will the difficulty of training (time-consuming per object, difficult to converge, etc.) become great?

jimmyyhwu commented 1 year ago

Having 5 sets of outputs (one per object type) allows us to condition the output on the object type. You could try using just one set of outputs but it might not work as well. For example, imagine the case where two objects are of similar shape but had very different canonical poses.

You should be able to train with 10 objects, it would just take longer to converge.

gfdeng commented 1 year ago

thank u, this help me a lot.

jimmyyhwu / pose-interpreter-networks

Questions about the output layer of the pose interpreter network #34