Open nicholast1985 opened 2 years ago
Thank you for your interest in our work! I copied my gold standard implementation to the following gist: https://gist.github.com/mtangemann/231f26d9066ce1b63c724abe3fef935e
Computing the gold standard however may take a while: Due to the leave-one-out cross validation a prediction has to be computed for every gaze position, which for 1000 frames are probably 10k-100k predictions. When using the relatively high resolution of LEDOV and computing on the CPU this easily takes hours. The implementation above is optimized for computing leave-one-out predictions and can be run on the GPU, so it should be faster. Let me know if you have further questions!
Wow! Thank you for uploading your implementation of the Gold Standard.... I am very grateful for it. The sigma (or bandwidth) = 1 degree of visual angle and has to match the bandwidth used when creating the saliency maps, right?
Had a chance to look at the code:
# create single map for each gaze position
maps_single = torch.zeros(gaze.size(0), *self.size_tuple).to(gaze.device)
maps_single[gaze_[:,0], gaze_[:,1], gaze_[:,2]] = 1
# create loo maps
map_all = maps_single.sum(dim=0)
maps = map_all.repeat(gaze.size(0), 1, 1) - maps_single
Concise and straight to the point! Doesn't this assume that every subject has 1 fixation per image? Due to the resolution mismatch between the eye tracker and the movie, I end up with more than 1 fixation per subject on each frame.... I was thinking of having multiple fixations per subject, and have each maps_single represent a single subject....
Great that the code helps! And yes you're right, it is assumed that there is at most one fixation per subject and frame. I preprocessed the gaze data so that there is at most one fixation per frame: I rounded the fixation start and end times to the frame times. So a fixation for a frame is only present if the respective event overlaps with the frame by >50% of the frame duration, which is the case for one fixation at most.
Awesome, thank you for the very swift responses. Last, and hopefully final, question: I have N subjects and would like to compare my model (trained in leave-one-movie out fashion) with the gold standard.
Say my left-out movie holds X amount of frames. If I understand correctly, I should (for every frame) 1) compute the gold standard for all N subjects in leave-one-subject out fashion (N-1 subjects to predict the Nth) (maps_repeat - maps_single) -> apply the gaussian and mix with the uniform 2) then take my N maps and compute the average over N 3) compute the CC (or any other metric) between the average of the N gold standards, and the ground truth 4) relate my leave-one-movie out model with the computed CC for the averaged gold standards?
Yes the gold standard is evaluated for every frame, but the predictions are not averaged: Instead, for every subject, the prediction given all other subjects is evaluated individually. So you get one performance value for any of the N maps. Afterwards, these performances are averaged for that frame and compared to the model prediction.
This requires a metric that allows to evaluate predictions for individual fixations, which however is not the case for CC. In our paper, we evaluated the gold standard using information gain, AUC and NSS. In general, I'd recommend using information gain as primary metric and computing the other metrics only for comparing with previous works (Kümmerer et al. 2018 analyzes this in detail).
Great! Thank you for all your help Matthias!
Part 1: To use the Information Gain as primary metric:
(out - out.min()) / (out.max() - out.min())
,
2) divide by the sum to convert the output into a probability distribution,
3.i) Compute the image-based KLD using probability distribution.
3.ii) Apply the log, and then compute the Log Likelihood and Information Gain, Question(s):
Part 2: Also, in Kümmerer et al. 2015, they say: "to evaluate metrics described above on the probabilistic models, we used the log-probability maps as saliency maps."...
Question(s):
- Does that mean I should compute the CC, NSS, and any other non-probabilistic metric using the log-probability? Kümmerer et al. 2017 explains how to compute the saliency maps from the predicted probability or logprobability for each individual metric.
- I found this reference online which Uses DeepGaze to get log_probability outputs. From there, they obtain the saliency maps (to be used on all other metrics) by exponentiating and dividing by the max.
I'm computing the CC, Image-Based KLD, NLL and NSS as sanity checks during training on the training set and validation set (but I'm only using the NLL as loss function when using option b).
Question(s):
Part 1:
Normalization using (out - out.min()) / (out.max() - out.min())
and dividing by the sum is one way to get a probability distribution, but there could be different ways that results in better predictions. Fitting a point-wise non linearity helps with finding that. This is mainly a concern when using a model which you don't train but only the point-wise non linearity. If you train the entire model yourself it should learn to use any normalization that you put at the end. The range normalization you proposed might however be difficult since the min and max vary for different samples.
The finalizer we use has some advantages in practice:
For the non-probabilistic metrics, there are ways to transform the log-probability in an optimal saliency map for that metric as described in the mentioned paper (e.g. blurring with the same std as used for computing the CC metric).
Part 2: One difference is that with option (a) you get probabilities whereas you get log probabilities with option (b). This means you need to adapt the computation of the metrics. Could this be the reason for the differences you observed?
I found this work really fascinating and I was wondering if you could provide code for computing the gold standard? I tried computing it using pysaliency but I must be doing something wrong because it takes a really long time to compute the gold standard for just 1000 frames.
How did you compute the gold standard for the Ledov dataset?