Some high-level comments. They can be addressed in a subsequent PR:
I think we should show fewer labels on the y axis for each multi-line plot, it looks too crowded.
I tried around 5 different executions and the mean line gets buried behind the other lines. I think we can either force it to draw it last or set a threshold of 5 executions to start drawing only dots for executions instead of lines.