Fine tune the pre-trained model using customized dataset

KamiCalcium commented 3 years ago

Hi,

I am trying to use PoseNet for my own dataset. More specifically, I now use PoseNet trained on Human3.6M and MPII to test my own dataset and I got some preliminary result. But I want to improve it further and the first thing come to my mind is to fine-tune the network using my dataset. Do you have any suggestion or experience (for example, how many epochs is good, or should I freeze any layers' weight?) on fine-tuning the PoseNet?

I have three ideas now:

I can use Human3.6M, my customized dataset, and MPII, three datasets together to train a new model.
I can just perform finetune on my own dataset. But not sure if it will degrade the model's performance on Human3.6M and MPII?
I can perform the fine-tune for the pre-trained model using three datasets together.

Which do you think make more sense? Thanks in advance for your time!

mks0601 commented 3 years ago

It depends on how your dataset looks like. Could you give me some image examples? Does it contain motion capture studio images? in-the-wild images? synthetic images?

KamiCalcium commented 3 years ago

It depends on how your dataset looks like. Could you give me some image examples? Does it contain motion capture studio images? in-the-wild images? synthetic images?

Thanks for replying! It is from cityscape dataset. https://www.cityscapes-dataset.com/ Some researchers annotates all person bounding box information and call it cityperson dataset: https://github.com/cvgroup-njust/CityPersons

All images are shot in the real cities.

mks0601 commented 3 years ago

I see. I think option 1 would be the best one. By the way, is this dataset contain 3D pose annotations? if it contains only 2D, fine-tuning a model only on this dataset would not work as no 3D supervision can be applied.

KamiCalcium commented 3 years ago

I see. I think option 1 would be the best one. By the way, is this dataset contain 3D pose annotations? if it contains only 2D, fine-tuning a model only on this dataset would not work as no 3D supervision can be applied.

It does not contain, but I make a kind of self-supervised model: first I use the pre-trained model you provided to predict all the 3D poses in that dataset. And then I use some annotation modifying tools to modify some bad/outliers 3D pose. Now I use those corrected 3D pose annotations as the ground truth. And thank you for your suggestion.

By the way, I have some issues related to the code but I don't want to open another issue so I'm asking here.

I did the option1, retraining everything together. However, I sometime get loss nan for some of the samples in my dataset. When debugging, I found that for those nan loss, the corresponding joint_vis is not np.ones. Now I'm confused about what is joint_vis:

In https://github.com/mks0601/3DMPPE_POSENET_RELEASE/blob/3f92ebaef214a0eb1574b7265e836456fbf3508a/data/Human36M/Human36M.py#L127, we set all the joint_vis to np.ones (This is from Human36M but I did the same thing for my dataset). However, in https://github.com/mks0601/3DMPPE_POSENET_RELEASE/blob/3f92ebaef214a0eb1574b7265e836456fbf3508a/data/dataset.py#L74, we change it when loading it. What does joint_vis really do?

KamiCalcium commented 3 years ago

I actually take a look in each epoch and find the problem:

For those loss is nan, I found that not all the loss are nan, there are some samples out of 128 samples in the batch are nan (index 127 in this case): this loss_coord is after this line: https://github.com/mks0601/3DMPPE_POSENET_RELEASE/blob/3f92ebaef214a0eb1574b7265e836456fbf3508a/main/train.py#L50

The shape of loss_coord is (128, 18), 128 is the batch size and 18 is the joint number. Do you have any idea about this bug? What can be wrong about those samples? I cannot think of a good way to debug this.. (I will just ignore this nan rows for now and let it train, I don't know if that makes sense..)

mks0601 commented 3 years ago

joint_vis represents is a joint valid or not. If GT coordinates of a joint is not provided (or trunctated), you can mark joint_vis to zero.

sainan-zhang commented 3 years ago

Hi, It's a great work! I'm learning your code but confused about the unit of the loss. Is it pixel?

mks0601 commented 3 years ago

x,y: pixel z: discretized meter

sainan-zhang commented 3 years ago

Thanks for your replying! I have another confusion. I found that L1 distance of joint is used as loss in training process, but when the network is testing, L2 distance (that is, Euclidean distance) is used. Why do you use different method to measure the error between the predicted joint coordinates and the ground truth?

mks0601 commented 3 years ago

We empirically found that L1 loss works better than L2.

sainan-zhang commented 3 years ago

x,y: pixel z: discretized meter

Are the units of loss in training and testing the same? In the training, the joints' x coordinate and y coordinate are in the pixel coordinate system. But for z coordinate, I found that in README file, which describes the quantities in "Human36M_subject_joint_3d.json", the unit of 173 joint coordinates in world coordinate system is milimeter. And z coordinate from .json file is directly used as z coordinate of joint_img when the network is trained. May I take it that the unit of loss is as follows: x, y: pixel, z: milimeter? In the testing, the joints' coordinates are in the camera coordinate system when it calculating the loss. May I take it that the unit of loss in testing is milimeter in x, y, z? Looking forward to your reply! Thanks!

mks0601 commented 3 years ago

May I take it that the unit of loss is as follows: x, y: pixel, z: milimeter? -> z: I discretize milimeters to 0~63 heatmap space.

May I take it that the unit of loss in testing is milimeter in x, y, z? -> yes.

sainan-zhang commented 3 years ago

Thanks very much!

mks0601 / 3DMPPE_POSENET_RELEASE

Fine tune the pre-trained model using customized dataset #98