mpatacchiola / deepgaze

Computer Vision library for human-computer interaction. It implements Head Pose and Gaze Direction Estimation Using Convolutional Neural Networks, Skin Detection through Backprojection, Motion Detection and Tracking, Saliency Map.
MIT License
1.79k stars 478 forks source link

Is there any example for this library? #42

Closed Ostnie closed 6 years ago

Ostnie commented 6 years ago

Hello,I used to do some job about head pose estimation . Your job are amazing , and I want to have a try ,but I don't find any example about how to use it , could you give me some advice ? Many thanks!

mpatacchiola commented 6 years ago

Hi @Ostnie,

All the examples are contained in the example folder of the repository. On the main page of the project you find links to videos and code for the head pose estimation algorithms:

Hope that helped you...

Ostnie commented 6 years ago

Could I use it in images?I find some example about video . It is unbelievable how fast you answer me! Thanks

mpatacchiola commented 6 years ago

Yes of course, it is really easy to use it in images. The only thing you have to do is to modify the OpenCV call. For videos you directly access the cameras or you open a video file, in this way:

video_capture = cv2.VideoCapture(file_path)

For images you simply have to add the path to your image and use the method 'imread()':

image = cv2.imread(file_name) #Read the image with OpenCV

The viedeo call returns a streaming of frames, whereas the image call returns a single image that you can process as you like.

Ostnie commented 6 years ago

Oh, I see, Many thanks!

Ostnie commented 6 years ago

@mpatacchiola Hi,I have run your code successfully on my computer and it help me a lot, but now I met a problem that the picture I want to use are under a wide range of head poses ,such as the range of yaw is between -90 degree to 90 degree ,and the outcome I get from your code is relatively small ,how can I solve this problem?

mpatacchiola commented 6 years ago

Hi @Ostnie

The accuracy of the CNN fades out for angles at the limit of the range. There is not much we can do because it is due to the dataset used for the training. In that dataset the number of images having such extreme poses were quite limited, and for this reason the ability of the network to generalize to those positions is scarce. As a workaroud you can try to mix multiple methods to get a mixture of experts estimation.

Ostnie commented 6 years ago

Hi @mpatacchiola ,I plan to train a model for big pose so I read your paper ,and I have some question,could your please help me? 1、In page 4 you said " We trained different CNNs for each degree of freedom. This kind of strategy has the advantage of splitting the main problem into different sub-problems which are easier to manage. Having a specialised network for roll, pitch and yaw, permits fine tuning the network for a specific degree of freedom without losing the predictive power obtained on another one " I still didn't know the final network is only one for both yaw,pitch,roll or three CNNs for each degree of freedom ? I guess you only get one for all degree ,but I don't know how to combine three into one ,you said finetuning .Could you explain the details?

2、I was just starting this research so I have some doubts about how we get the final angle ? In your paper I see you divided each degree to different groups by steps of 15 .You seem to be turning them into a classification problem,but your program can accurately estimate the angle between them such as 1-14 degree ,I don't know whether I describe the question clearly . For example ,assuming that my current angle is only 0 and 90, how do you estimate the median value of 45?

Many thanks!

mpatacchiola commented 6 years ago

Hi @Ostnie

  1. Yes there are three different CNNs for each degree of freedom. The problem with many datasets out there is that roll pitch and yaw are not uniformly distributed. For instance you can have a dataset with 20k images (like the AFLW) in which the roll distribution is strongly peaked around a mean value and the standard deviation is very small. On the other hand the yaw angle in the same dataset has a large standard deviation. In the roll case it may be better to use a small network to avoid overfitting, whereas for the yaw angle it is possible to use a larger network because we have higher variance. Having three different networks you can "fine tune" the architecture and hyperparameter of each one and obtain better results. Combine them is easy, once you get an input image you run all of them separately.

  2. In my article the CNN have a continuous output, and the problem is considered to be a regression problem. However, you can also rethink the problem in classification terms. You can imagine to divide the continuous space into bins. If the continuous output of the network is included in one of those bins then you say that the network predicted that class. The discretization has been necessary in order to compare the performance of the network with other methods.

Ostnie commented 6 years ago

I'm sorry to bother you again. By reading your paper I found that the CNN network you use is very simple, the most complex reference network is just Alexnet, more is the use of the Lenet and his variants, why did not try to Use more complex networks such as vgg or resnet? Besides this, I have another request. I haven't used CNN to do regression-related issues before. I haven’t found any tutorials in the past few days. I’d like to ask what kind of changes will be made to the network between the regression and classified. ? Some people say that you can change the softmax function to a sigmoid function. Is that correct? If you have the relevant experience to introduce or web tutorial recommended it would be great, thank you very much!

mpatacchiola commented 6 years ago

Hi @Ostnie

I did not use any deeper model for a series of reasons. When I started working for that article ResNet was not a widely adopted architecture whereas VGG was a large model that did not fit into the datasets I was managing. For sure, an extension of my work can be the use of a ResNet.

In regression you have a continous output from the network, instead of using a softmax you can use a sigmoid or a tanh function (in my article I used a tanh). The loss function is generally the mean squarred error between the target value and the output of the net. In thi end this is not so different from a classification problem. In my article you can find all the details and there is the code of the network available in Deepgaze that you can study.