While the performance difference between the best and second best systems in Fig 1 and Fig 2 of our paper are clear enough, the low number of samples (n=5) produces strange box plots (median values close to the top or bottom of a box, large variance of the height of the plots and duplicate values) and most systems cannot be reliably ranked. Increasing the number of runs of parser training (with different initialisation of the parser parameters) slightly, e.g. to n=9 (adding 4 runs), should reduce this problem substantially.
Any n = 5 +4k has a simple mapping of results to box plot boundaries, e.g. for n=9 the 1st, 3rd, 5th, 7th and 9th best result give the bottom line, bottom of box, median, top of box, and top line, assuming simple box plots without outliers.
While the performance difference between the best and second best systems in Fig 1 and Fig 2 of our paper are clear enough, the low number of samples (n=5) produces strange box plots (median values close to the top or bottom of a box, large variance of the height of the plots and duplicate values) and most systems cannot be reliably ranked. Increasing the number of runs of parser training (with different initialisation of the parser parameters) slightly, e.g. to n=9 (adding 4 runs), should reduce this problem substantially.
Any n = 5 +4k has a simple mapping of results to box plot boundaries, e.g. for n=9 the 1st, 3rd, 5th, 7th and 9th best result give the bottom line, bottom of box, median, top of box, and top line, assuming simple box plots without outliers.
Related: