psychoinformatics-de / remodnav

Robust Eye Movement Detection for Natural Viewing
Other
59 stars 16 forks source link

Script to reproduce basic stats reported in Anderson et al/ compare algorithm against them #2

Closed mih closed 5 years ago

mih commented 6 years ago

ATM we cannot reproduce (reason is subject to research). Here is a summary of the differences: duration stats for fixations, saccades, PSOs and pursuits. The first value is our's, the one in parenthesis is the one reported in the paper. Note, our values are not own detection results, but stats computed from their released data. Subjectively substantial deviations are in BOLD, although there really should be no deviations at all, and the code to compute all stats is the same for all events (see PR).

Conclusion: We get what is in the paper for saccades and PSOs, but something substantially different for number of fixations.

Fixation durations

Coder IMG-Mean IMG-SD IMG-No VID-Mean VID-SD VID-No
MN 252 (248) 285 (271) 403 (380) 304 (318) 277 (289) 82 (67)
RA 247 (242) 288 (273) 391 (369) 232 (240) 177 (189) 81 (67)

Saccade durations

Coder IMG-Mean IMG-SD IMG-No VID-Mean VID-SD VID-No
MN 29 (30) 17 (17) 377 (376) 26 (26) 13 (13) 117 (116)
RA 31 (31) 15 (15) 374 (372) 25 (25) 12 (12) 127 (126)

PSO durations

Coder IMG-Mean IMG-SD IMG-No VID-Mean VID-SD VID-No
MN 21 (21) 11 (11) 313 (312) 20 (20) 11 (11) 97 (97)
RA 21 (21) 9 (9) 310 (309) 17 (17) 8 (8) 89 (89)

Pursuit durations

Coder IMG-Mean IMG-SD IMG-No VID-Mean VID-SD VID-No
MN 363 (363) 153 (187) 3 (3) 528 (521) 344 (347) 51 (50)
RA 299 (305) 175 (184) 17 (16) 481 (472) 317 (319) 70 (68)

Confusions

Assuming we make no mistakes extracting the Anderson labels, here is how our algorithm performs re confusions.

MN vs. RA

This gives us the baseline

image

algorithm vs. coder MN

image

algorithm vs. coder RA

image

Mis-classification summary stats

For all pairwise comparisons, this shows the overall misclassification rate (using timepoints as unit of measure, and limited to timepoints that have been labeled with FIX, SAC, PSO, or PUR by any method, hence ignoring NaN/blinks and undefined (which is rarely used)), same misclassification rate as before, but ignoring PUR events too. The remaining numbers are percentages of labels used in die misclassified samples. In contrast to the paper the method label that is misclassifying is given (not "over" and "under", as I found this confusing).

images

Analog to table 8 in the paper

Comparison MCLF MCLFw/oP Method Fix Sacc PSO SP
MN v RA 6.2 3.1 MN 68 11 21 0
-- -- -- RA 15 14 20 52
MN v ALGO 33.3 11.2 MN 88 1 10 1
-- -- -- ALGO 2 16 8 74
RA v ALGO 33.6 10.4 RA 81 2 9 8
-- -- -- ALGO 7 16 8 69

dots

Analog to table 9 in the paper

Comparison MCLF MCLFw/oP Method Fix Sacc PSO SP
MN v RA 11.1 5.0 MN 11 9 9 71
-- -- -- RA 64 7 6 23
MN v ALGO 23.9 9.7 MN 10 1 6 83
-- -- -- ALGO 74 8 5 12
RA v ALGO 26.7 10.4 RA 25 2 4 69
-- -- -- ALGO 60 10 5 25

videos

Analog to table 10 in the paper

Comparison MCLF MCLFw/oP Method Fix Sacc PSO SP
MN v RA 18.5 4.0 MN 75 3 8 15
-- -- -- RA 16 4 3 77
MN v ALGO 38.1 10.6 MN 37 1 5 58
-- -- -- ALGO 54 9 6 31
RA v ALGO 38.6 11.7 RA 22 1 4 73
-- -- -- ALGO 66 10 7 17

Interim conclusion

Performance looks good. Without pursuit this looks better than the stats in the paper (although I am not 100% confident that we compute things the exact same way). Confusion patterns with and without pursuit look sensible.

Critical feedback appreciated!

codecov-io commented 6 years ago

Codecov Report

Merging #2 into master will increase coverage by 0.59%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #2      +/-   ##
==========================================
+ Coverage   88.92%   89.51%   +0.59%     
==========================================
  Files           8        8              
  Lines         677      677              
==========================================
+ Hits          602      606       +4     
+ Misses         75       71       -4
Impacted Files Coverage Δ
remodnav/tests/test_detect.py 98.57% <0%> (+5.71%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 64e2534...944bfb1. Read the comment docs.

adswa commented 6 years ago

These results look neat. Thanks a lot for all the work!

I have not succeeded in finding an definite explanation for the weird occasional differences in fixation and pursuit duration. However, their paper was published online/accepted in 2016 -- the main analysis code was published on Github in Aug 2017 with the commit message 'Added code for processing data extracted from algorithms, including latest(?) version of extracted data. More code to follow.' Maybe these 'latest(?)' versions of extracted data bear the differences between the results reported in the paper and their more recent code. It at least suggests that anything more recent than something previous has been shared a year after the paper was accepted.

And to quote their readme:

"Some values in the scripts may have been interactively changed during the analysis, so it should not be a interpreted as run-once-for-complete-results code."

as well as

"Some of the matlab code used for the publication"

with an emphasis on some -I wasn't able to find some functions anywhere, e.g. their function simpleAgreement (in mainDetection, used for calculation of proportion of correct classifications), so in my opinion it seems impossible or at least unfeasible to figure out the exact way of their computations.

The way you computed the results looks perfectly reasonable to me regardless of whether this may or may not be a 100% replication of their analysis. Its cool that the performance without pursuits exceeds their stats. For the results with added pursuits the results are still showing good performance of remodnav, and its understandable that adding a classification label to the confusion matrix is more likely to decrease overall classification accuracy compared to the papers "only Top 3 events" approach.

mih commented 6 years ago

@AdinaWagner Thx!

I have updated the top comment with plots that also compare MN to RA to give us a baseline reference. I also added the analog of tables 8-10 to the top comment. Still looking good.

mih commented 5 years ago

OK, the test passed -- I will merge this now to give us a starting point for the final stretch.

adswa commented 5 years ago

Richard Anderson presumably uploaded all of the files, I will try to rerun the script locally to see whether the results finally reproduce the article's if I include the file previously missing

adswa commented 5 years ago

Maybe I'm missing something here, or maybe I don't fully understand the part of the script with the confusion matrices.

The most interesting questions answered first: The missing file seemed to be a file called UH33_trial17_labelled_MN.mat. The new data directory is luckily more organized than before, and only paired datafiles are present.

I tried to feed the new data directory into @mih's eval/anderson.py script. This works well for print_duration stats and yields - unsurprisingly - the same results as before in the first summary tables of this conversation (here is a terminal output, for anyone interested to check)

images MN FIX: 0.252 (0.285) [403] SAC: 0.029 (0.017) [377] PSO: 0.021 (0.011) [313] PURS: 0.363 (0.153) [3] images RA FIX: 0.247 (0.288) [391] SAC: 0.031 (0.015) [374] PSO: 0.021 (0.009) [310] PURS: 0.299 (0.175) [17] dots MN FIX: 0.191 (0.088) [12] SAC: 0.023 (0.010) [47] PSO: 0.015 (0.005) [33] PURS: 0.363 (0.233) [48] dots RA FIX: 0.168 (0.090) [21] SAC: 0.022 (0.011) [47] PSO: 0.015 (0.008) [28] PURS: 0.367 (0.329) [45] videos MN FIX: 0.304 (0.277) [82] SAC: 0.026 (0.013) [117] PSO: 0.020 (0.011) [97] PURS: 0.528 (0.344) [51] videos RA FIX: 0.232 (0.177) [81] SAC: 0.025 (0.012) [127] PSO: 0.017 (0.008) [89] PURS: 0.481 (0.317) [70]

However, the confusion() function fails for comparisons of the two human coders (i.e. when running confusion('RA', 'MN'), not when running the comparison with the 'ALGO' option )

The error is due to a mismatch in shape, for example:

<ipython-input-99-317f50ae23b8> in confusion(refcoder, coder)
     70                     intersec = np.sum(np.logical_and(
     71                         labels[0] == anderson_remap[c1label],
---> 72                         labels[1] == anderson_remap[c2label]))
     73                     union = np.sum(np.logical_or(
     74                         labels[0] == anderson_remap[c1label],

ValueError: operands could not be broadcast together with shapes (4990,) (4988,) 

I've tried this for a bunch of input files, and the funny thing is, this error emerges for couple of files but not for all. I ran a diff -r between the old and the new data directory, and the error emerges consistently and only for these files where diff -r indicates that the content of the files has changed from the old directory to the new directory:

╭─adina@odin ~/Repos/remodnav/remodnav/tests/data on new_anderson+!
╰─➤ diff -r anderson_etal/annotated_data/complete_data/images anderson_etal_old/annotated_data/images                                                                                                          2 ↵
Binary files anderson_etal/annotated_data/complete_data/images/TH34_img_vy_labelled_MN.mat and anderson_etal_old/annotated_data/images/TH34_img_vy_labelled_MN.mat differ
Only in anderson_etal_old/annotated_data/images: TH38_img_Europe_labelled_RA.mat
Only in anderson_etal_old/annotated_data/images: TH46_img_Rome_labelled_RA.mat
Only in anderson_etal_old/annotated_data/images: TH50_img_vy_labelled_RA.mat
Only in anderson_etal_old/annotated_data/images: TL44_img_konijntjes_labelled_RA.mat
Only in anderson_etal_old/annotated_data/images: TL48_img_Europe_labelled_RA.mat
Only in anderson_etal_old/annotated_data/images: TL48_img_Rome_labelled_RA.mat
╭─adina@odin ~/Repos/remodnav/remodnav/tests/data on new_anderson+!
╰─➤ diff -r anderson_etal/annotated_data/complete_data/videos anderson_etal_old/annotated_data/videos                                                                                                          1 ↵
Binary files anderson_etal/annotated_data/complete_data/videos/TH38_video_dolphin_fov_labelled_MN.mat and anderson_etal_old/annotated_data/videos/TH38_video_dolphin_fov_labelled_MN.mat differ
Only in anderson_etal_old/annotated_data/videos: TH46_video_BergoDalbana_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TH46_video_BiljardKlipp_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TH50_video_TrafikEhuset_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TL32_video_triple_jump_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TL40_video_BiljardKlipp_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TL44_video_triple_jump_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: TL48_video_TrafikEhuset_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: UH27_video_TrafikEhuset_labelled_RA.mat
Binary files anderson_etal/annotated_data/complete_data/videos/UL23_video_triple_jump_labelled_MN.mat and anderson_etal_old/annotated_data/videos/UL23_video_triple_jump_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/videos/UL23_video_triple_jump_labelled_RA.mat and anderson_etal_old/annotated_data/videos/UL23_video_triple_jump_labelled_RA.mat differ
Binary files anderson_etal/annotated_data/complete_data/videos/UL27_video_triple_jump_labelled_MN.mat and anderson_etal_old/annotated_data/videos/UL27_video_triple_jump_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/videos/UL27_video_triple_jump_labelled_RA.mat and anderson_etal_old/annotated_data/videos/UL27_video_triple_jump_labelled_RA.mat differ
Binary files anderson_etal/annotated_data/complete_data/videos/UL31_video_triple_jump_labelled_MN.mat and anderson_etal_old/annotated_data/videos/UL31_video_triple_jump_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/videos/UL31_video_triple_jump_labelled_RA.mat and anderson_etal_old/annotated_data/videos/UL31_video_triple_jump_labelled_RA.mat differ
Only in anderson_etal_old/annotated_data/videos: UL43_video_TrafikEhuset_labelled_RA.mat
Only in anderson_etal_old/annotated_data/videos: UL47_video_BiljardKlipp_labelled_RA.mat
╭─adina@odin ~/Repos/remodnav/remodnav/tests/data on new_anderson+!
╰─➤ diff -r anderson_etal/annotated_data/complete_data/dots anderson_etal_old/annotated_data/dots                                                                                                              1 ↵
Only in anderson_etal_old/annotated_data/dots: TH34_trial17_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TH36_trial17_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TH38_trial17_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TH50_trial1_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TL24_trial1_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TL32_trial17_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TL32_trial1_labelled_RA.mat
Only in anderson_etal_old/annotated_data/dots: TL44_trial1_labelled_RA.mat
Binary files anderson_etal/annotated_data/complete_data/dots/UH21_trial1_labelled_MN.mat and anderson_etal_old/annotated_data/dots/UH21_trial1_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/dots/UH21_trial1_labelled_RA.mat and anderson_etal_old/annotated_data/dots/UH21_trial1_labelled_RA.mat differ
Only in anderson_etal_old/annotated_data/dots: UH31_trial17_labelled_RA.mat
Only in anderson_etal/annotated_data/complete_data/dots: UH33_trial17_labelled_MN.mat
Binary files anderson_etal/annotated_data/complete_data/dots/UH33_trial17_labelled_RA.mat and anderson_etal_old/annotated_data/dots/UH33_trial17_labelled_RA.mat differ
Only in anderson_etal_old/annotated_data/dots: UH33_trial1_labelled_MN.mat
Only in anderson_etal_old/annotated_data/dots: UL25_trial17_labelled_RA.mat
Binary files anderson_etal/annotated_data/complete_data/dots/UL27_trial17_labelled_MN.mat and anderson_etal_old/annotated_data/dots/UL27_trial17_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/dots/UL27_trial17_labelled_RA.mat and anderson_etal_old/annotated_data/dots/UL27_trial17_labelled_RA.mat differ
Only in anderson_etal_old/annotated_data/dots: UL29_trial17_labelled_RA.mat
Binary files anderson_etal/annotated_data/complete_data/dots/UL39_trial1_labelled_MN.mat and anderson_etal_old/annotated_data/dots/UL39_trial1_labelled_MN.mat differ
Binary files anderson_etal/annotated_data/complete_data/dots/UL39_trial1_labelled_RA.mat and anderson_etal_old/annotated_data/dots/UL39_trial1_labelled_RA.mat differ
Only in anderson_etal_old/annotated_data/dots: UL47_trial1_labelled_RA.mat
╭─adina@odin ~/Repos/remodnav/remodnav/tests/data on new_anderson+!

If anyone has an idea what I am missing here, I'd be grateful for an enlightenment -- if it is worth pursuing this confusion. I was able to compute the confusion matrices between human and algorithm, and they show negligible differences (in the dot category) algo_vs_mn algo_vs_ra

Cheers!

mih commented 5 years ago

Thx, will push a PR with the needed changes in a few min.

mih commented 5 years ago

https://github.com/psychoinformatics-de/remodnav/pull/3

adswa commented 5 years ago

Thx!