State of the art on egocentric saliency prediction

You should identify and read some (3-5) scientific papers or works where similar to your research.

I think that in the world of egocentric they use a lot the term "attention" as a similar concept to saliency.

I have found some papers that I would like you to look at and write a short summary (one paragraph for each):

Yamada, Kentaro, Yusuke Sugano, Takahiro Okabe, Yoichi Sato, Akihiro Sugimoto, and Kazuo Hiraki. "Can saliency map models predict human egocentric visual attention?." In Computer Vision–ACCV 2010 Workshops, pp. 420-429. Springer Berlin Heidelberg, 2010.

Matsuo, Kenji, Kentaro Yamada, Satoshi Ueno, and Sei Naito. "An attention-based activity recognition for egocentric video." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 551-556. 2014.

Bettadapura, Vinay, Irfan Essa, and Caroline Pantofaru. "Egocentric field-of-view localization using first-person point-of-view devices." In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pp. 626-633. IEEE, 2015.

Fathi, Alireza, Yin Li, and James M. Rehg. "Learning to recognize daily actions using gaze." In Computer Vision–ECCV 2012, pp. 314-327. Springer Berlin Heidelberg, 2012

In particular, I want you to answer this questions:

Which dataset did they use ?
Which metric did they use ?
Is there any aspect in their methodology we could adopt ?
Could we compare our results with them somehow ?

Answer to this issue with a paragraph every time you finish reading each paper. Make sure you answer the questions I posed.

Can saliency map models predict human egocentric visual attention?

This experiment tries to calculate the best option for predict the human egocentric visual attention. They used like a dataset different images in the same environment: in a room. Four different persons (one at a time) sit on a chair and another different person walking randomly around the first person. The two subjects were looking all the room moving their head freely for one minute. The CNN that they used for compute saliency maps for egocentric videos (based on Itti et al's model [2] and Harrel et al's model [3]) works like a pyramid. The first level makes a feature decomposition of the input image using simple linear filters. These filters decompose the image in static features (intensity, colour and orientation) and dynamic features (motion and flicker). In the second level features maps are computes from Gaussian pyramid. In the last stage, the final saliency map is obtained by combining the normalized features maps. The results of the experiment show that the saliency maps are the best way to predict the visual human attention. But the conventional saliency map models detect better the static features than the dynamic features. For this reason they conclude that many regions can be erroneously assigned. Also, they conclude that it's necessary to make a study to train the networks for the dynamic features cannot decreased their performance for egocentric vision because these models cannot predict the ego motion like a person.

An attention-based activity recognition for egocentric video.

In this paper, they propose a different method to improve recognition activity. A recent work presented a method of predict key objects by hand manipulation. But this can make problems because not all the objects can be manipulated by hands and no all the objects are being manipulated are important objects for the visual human attention. In their new method, the visual attention is also calculated not only from static saliency, but also from dynamic ego motions. They made changes in the block diagram of the conventional method: Object detection: This first step detect options from an egocentric video. Each detected object is classified into active or passive object and the method provides his region and likelihood. Attention quantification: This step extracts the user's interest with each frame and constructs the visual attention map. To construct the saliency map, they use the approach proposed by Chen et al. The final goal of this step is to measure the camera rotation. This is equivalent than the user's own motion (ego motion). Attention assignment: This step finds the objects on the visual attention are principally focused in. Each object is classified into salient or non-salient object depending if the value of the attention is over or bellow than a threshold. Descriptor generation: The out of this step is a four dimensional histogram. The detected object is classified into four segments based on if they are salient or non-salient and passive or active. Temporal pyramid representation: It provides temporal robustness to activity recognition. The last step is the SVM (Support Vector Machine) classification

The dataset of this work consisted on 20 different persons recording egocentric videos in their own homes. They conclude that their method can predict saliency regions better than the hand-based approach.

imatge-upc / egocentric-2016-saliency

State of the art on egocentric saliency prediction #2