Computing the gold standard

nicholast1985 commented 2 years ago

I found this work really fascinating and I was wondering if you could provide code for computing the gold standard? I tried computing it using pysaliency but I must be doing something wrong because it takes a really long time to compute the gold standard for just 1000 frames.

How did you compute the gold standard for the Ledov dataset?

mtangemann commented 2 years ago

Thank you for your interest in our work! I copied my gold standard implementation to the following gist: https://gist.github.com/mtangemann/231f26d9066ce1b63c724abe3fef935e

Computing the gold standard however may take a while: Due to the leave-one-out cross validation a prediction has to be computed for every gaze position, which for 1000 frames are probably 10k-100k predictions. When using the relatively high resolution of LEDOV and computing on the CPU this easily takes hours. The implementation above is optimized for computing leave-one-out predictions and can be run on the GPU, so it should be faster. Let me know if you have further questions!

nicholast1985 commented 2 years ago

Wow! Thank you for uploading your implementation of the Gold Standard.... I am very grateful for it. The sigma (or bandwidth) = 1 degree of visual angle and has to match the bandwidth used when creating the saliency maps, right?

nicholast1985 commented 2 years ago

Had a chance to look at the code:

    # create single map for each gaze position
    maps_single = torch.zeros(gaze.size(0), *self.size_tuple).to(gaze.device)
    maps_single[gaze_[:,0], gaze_[:,1], gaze_[:,2]] = 1

    # create loo maps
    map_all = maps_single.sum(dim=0)
    maps = map_all.repeat(gaze.size(0), 1, 1) - maps_single

Concise and straight to the point! Doesn't this assume that every subject has 1 fixation per image? Due to the resolution mismatch between the eye tracker and the movie, I end up with more than 1 fixation per subject on each frame.... I was thinking of having multiple fixations per subject, and have each maps_single represent a single subject....

mtangemann commented 2 years ago

Great that the code helps! And yes you're right, it is assumed that there is at most one fixation per subject and frame. I preprocessed the gaze data so that there is at most one fixation per frame: I rounded the fixation start and end times to the frame times. So a fixation for a frame is only present if the respective event overlaps with the frame by >50% of the frame duration, which is the case for one fixation at most.

nicholast1985 commented 2 years ago

Awesome, thank you for the very swift responses. Last, and hopefully final, question: I have N subjects and would like to compare my model (trained in leave-one-movie out fashion) with the gold standard.

Say my left-out movie holds X amount of frames. If I understand correctly, I should (for every frame) 1) compute the gold standard for all N subjects in leave-one-subject out fashion (N-1 subjects to predict the Nth) (maps_repeat - maps_single) -> apply the gaussian and mix with the uniform 2) then take my N maps and compute the average over N 3) compute the CC (or any other metric) between the average of the N gold standards, and the ground truth 4) relate my leave-one-movie out model with the computed CC for the averaged gold standards?

mtangemann commented 2 years ago

Yes the gold standard is evaluated for every frame, but the predictions are not averaged: Instead, for every subject, the prediction given all other subjects is evaluated individually. So you get one performance value for any of the N maps. Afterwards, these performances are averaged for that frame and compared to the model prediction.

This requires a metric that allows to evaluate predictions for individual fixations, which however is not the case for CC. In our paper, we evaluated the gold standard using information gain, AUC and NSS. In general, I'd recommend using information gain as primary metric and computing the other metrics only for comparing with previous works (Kümmerer et al. 2018 analyzes this in detail).

nicholast1985 commented 2 years ago

Great! Thank you for all your help Matthias!

Part 1: To use the Information Gain as primary metric:

Option a): Get model's output (any other activation function), continue as normal with my non-probabilistic metrics (CC, NSS, etc...) and for the probabilistic metrics: 1) range normalize (out - out.min()) / (out.max() - out.min()), 2) divide by the sum to convert the output into a probability distribution, 3.i) Compute the image-based KLD using probability distribution. 3.ii) Apply the log, and then compute the Log Likelihood and Information Gain,
Option b): Add a Finalizer (gaussian blur, center bias, log-softmax) and get log-probability outputs.

Question(s):

What happens if I don't use Option b and go with Option a? ~~- How does Option b fulfill the requirements of "Phrasing Saliency Maps Probabilistically" in Kümmerer et al. 2015 without adding the pointwise monotonic nonlinearity?~~ -How do I know if I need to add a point-wise monotonic non-linearity; Phrasing Saliency Maps Probabilistically in Kummerer 2015? My model is a DNN built on top of a feature extractor (which does get trained), has skip connections, and ends with a linear projection? One could argue given the nature of DNNs, that the output is not necessarily linear...

Part 2: Also, in Kümmerer et al. 2015, they say: "to evaluate metrics described above on the probabilistic models, we used the log-probability maps as saliency maps."...

Question(s): ~~- Does that mean I should compute the CC, NSS, and any other non-probabilistic metric using the log-probability?~~ Kümmerer et al. 2017 explains how to compute the saliency maps from the predicted probability or logprobability for each individual metric. - I found this reference online which Uses DeepGaze to get log_probability outputs. From there, they obtain the saliency maps (to be used on all other metrics) by exponentiating and dividing by the max.

I'm computing the CC, Image-Based KLD, NLL and NSS as sanity checks during training on the training set and validation set (but I'm only using the NLL as loss function when using option b).

Question(s):

When using option a), all my plots seem right. Except that I wouldn't be properly using the Information Gain, right?
When using option b) with log-probability maps as saliency maps, I get a very zig-zaggy CC plot on the validation set, but not the training set, and a different scale NSS plot...
~~- Can I do this:~~ ~~1) sal_map = exp(log_probability) # essentially going back from logsoftmax to softmax~~ ~~2) sal_map = sal_map / sal_map.sum()~~ ~~instead of this:~~ ~~1) sal_map = log_probability~~ ~~and compute the remaining non-probability metrics using sal_map? My CC plot does look much better~~

mtangemann commented 2 years ago

Part 1: Normalization using (out - out.min()) / (out.max() - out.min()) and dividing by the sum is one way to get a probability distribution, but there could be different ways that results in better predictions. Fitting a point-wise non linearity helps with finding that. This is mainly a concern when using a model which you don't train but only the point-wise non linearity. If you train the entire model yourself it should learn to use any normalization that you put at the end. The range normalization you proposed might however be difficult since the min and max vary for different samples.

The finalizer we use has some advantages in practice:

The blurring is a bias that close pixels typically have a similar likelihood of being attended.
Explicitely adding the center bias comes from a Bayesian perspective where the center bias corresponds to the prior. Moreover, we used fully convolutional networks as feature extractor. Those don't have explicit information on whether a certain location is at the image center which makes learning the center bias hard.
Predicting log probabilities has numerical advantages, as the predicted probabilities are often close to zero.

For the non-probabilistic metrics, there are ways to transform the log-probability in an optimal saliency map for that metric as described in the mentioned paper (e.g. blurring with the same std as used for computing the CC metric).

Part 2: One difference is that with option (a) you get probabilities whereas you get log probabilities with option (b). This means you need to adapt the computation of the metrics. Could this be the reason for the differences you observed?

mtangemann / deepgazemr

Computing the gold standard #2