mysee1989 / TCAE

Self-supervised Representation Learning from Videos for Facial Action Unit Detection
162 stars 27 forks source link

the train result is weird #13

Open makpia opened 4 years ago

makpia commented 4 years ago

I trained the model using your code with default settings, and changed nothing except the batch_size(64 to 32, due to my gpu's memory). The result after 1900 epochs is really weird. Here are photos of total error and an example of au-changed image.

image

image

I cannot tell where the problem is. Should I change the weights of the losses? I noticed that the weights in you code is different from those in your paper.

guman90203 commented 4 years ago

@makpia Sorry, I have small question, how do you get these output image? I search for my folder and didn't see the output image , but I really need the picture for presentation

makpia commented 4 years ago

@makpia Sorry, I have small question, how do you get these output image? I search for my folder and didn't see the output image , but I really need the picture for presentation

I generate the output image by using torchvision.utils.save_image(). The writer.add_image() in the original code didn't generate any images either in folder or in tensorboard in my computer, too. When you successfully generate output images, I would like to know if you reach the same problem as mine.

mysee1989 commented 4 years ago

There is no problem in the loss weight setting in the original paper. The reconstruction loss weight is 10x than pixel/feature consistent loss weight, and 100x than L1 loss weight.

In the public code, the loss weight of each loss item is 10x, which can speed up the convergence.

Please check you data pcocessing step carefully. The training images should be detected & aligned & cropped. The training images you show here are quite strange and have not been processed correctly.

makpia commented 4 years ago

There is no problem in the loss weight setting in the original paper. The reconstruction loss weight is 10x than pixel/feature consistent loss weight, and 100x than L1 loss weight.

In the public code, the loss weight of each loss item is 10x, which can speed up the convergence.

Please check you data pcocessing step carefully. The training images should be detected & aligned & cropped. The training images you show here are quite strange and have not been processed correctly.

Thanks for your reply. I cropped the voxceleb2 images according to the txt in voxceleb website and the instruction in 'https://github.com/cyrta/voxceleb/blob/master/data/v1/voxceleb1_readme.txt'. I didn't use other detector to locate the face. I cropped the face regions according to the txt directly. Here are some examples of my training images.

image

image

image

Vox2's cropped region is larger than vox1, but the code will centercrop it so I think the region is accurate. Can you decribe more details about the "strange" images? I think my data are cropped correctly. About the loss weight: Now I have changed the learning rate from 0.001 to 0.0001, and delete the 'w7 out_img_pose_loss_sort + w8 out_img_exp_loss_sort' these two losses which are not mentioned in you parer. Now the early result(600 epoch) shows that the model starts to learn au and pose though it is still twisted, but I am not sure if the final result is good as yours.

image

mysee1989 commented 4 years ago

The training images should be aligned & cropped according to facial landmarks. The training images you show here are not correctly processed.

There is no need to adjust the loss weight currently.

makpia commented 4 years ago

The training images should be aligned & cropped according to facial landmarks. The training images you show here are not correctly processed.

There is no need to adjust the loss weight currently.

Thanks for your reply. After checking cropping results and comparing your cropping code with FAb-Net's, I found that FAb-Net crop the upper middle region of the precropped image which will get a tightly cropped face image, while your code does not narrow the face region without a external face detector to crop the image. I guess the problem will be solved this time. Thanks for your advice.

makpia commented 4 years ago

Here is the result after changing the cropped region more "tight". The training paramaters are exactly according to the paper. image The AU classifier trained based on this result, and trained on BP4D, get the f1 result average in 56.2%. However, though the average is closed to the result in paper, the f1 scores of each AU are 0.582、0.571、0.566、0.562、0.560、0.563、0.575、0.570、0.560、0.561、0.563、0.514, which is far away from all those previous work. 4%87H{FNXD}G9L9B WMRDGD So, I want to ask several questions:

  1. Are my cropping process correct now?
  2. The visualization results shows that only the pose movement are extracted, and the expression movement doesn't shows in either exp-changed image or output image. What factors do you think might cause this problem? How can I fix this?
  3. Can you tell me more details about the evaluation that you think might help me fix the problem?
  4. Can you share your trained model? I have been frustrated not getting the ideal experiment result.
mysee1989 commented 4 years ago

The model will be public these days.

anmolduainter commented 3 years ago

I was training with different dataset and my example pair looks like this. source target

Is this preprocessing okay? Using above pair like images I start training process and after some epochs when I checked out the results, it seems expression are still biased towards source image and not target image. I also found this type of behaviour on above image posted by makpia. Am I thinking it in right direction or I miss something. Please clarify. Thanks!

zzx820302704 commented 3 years ago

@makpia Have you downloaded the voxcele1/2 datasets? Can you share it? It's too big

makpia commented 3 years ago

@makpia Have you downloaded the voxcele1/2 datasets? Can you share it? It's too big

well, the dataset consists of more than 2*10^8 images(i only downloaded about 70%, it reaches the maximun file amount), so i cannot compress or even copy it due to the massive processing time. and i dont have enough space for it either. you can follow the method mentioned in https://github.com/cyrta/voxceleb/blob/9d0aa82e14a44465b3eaf818872cd74ef9edb42b/data/v1/voxceleb1_readme.txt to download the dataset. the size of the whole dataset is about 3~4tb i guess. it cause me about a month to download since I need to use proxy server, which is unstable, slow, and time-consuming to employ and debug. with unblocked network, it wont cost more than 2 weeks to finish downloading and clipping.

zzx820302704 commented 3 years ago

@makpia Thank you for your reply. Could you please leave your WeChat? I would like to ask you some questions

wwt0805 commented 3 years ago

@makpia Thank you for your reply. Could you please leave your WeChat? I would like to ask you some questions

I've downloaded the whole voxceleb2 and conbined the part a to i as one Zip file about 280G. I tried to release it but when the process reached 50%, the image files at least 1T. My SSD couldn't have any storage! So if you want to get the full dataset it may take up 2T storage space.

makpia commented 3 years ago

@makpia Thank you for your reply. Could you please leave your WeChat? I would like to ask you some questions

sure. but i didnt set a wechat code. you can add me using qq601936549

makpia commented 3 years ago

@makpia Thank you for your reply. Could you please leave your WeChat? I would like to ask you some questions

I've downloaded the whole voxceleb2 and conbined the part a to i as one Zip file about 280G. I tried to release it but when the process reached 50%, the image files at least 1T. My SSD couldn't have any storage! So if you want to get the full dataset it may take up 2T storage space.

it is really a difficult task to do so. a way to zip and pack this dataset is to pack the clip first. it means you need to replace the images in one folder as one file, which can save a lot of disk space but an extra extracting operation will be needed during the code is running.