Correct post-hoc test for significance?

Hello,

I noticed that the Conover post-hoc tests yields very close to zero p-values, which seems a bit unusual as the box plots show quite a lot of overlap in SDR values between the methods... Could this be because it is currently computed by taking a vector of all segment-wise observations from each method and comparing them, ignoring that some segments are correlated because they belong to the same song? This is how it looks to me at least:

sp.posthoc_conover(df_voc, val_col='score', group_col='estimate')

I don't know stats very well, but could it be that we need to apply a blocked design, in which the segment-wise observations from the same song are put into one block? I think block assignments are supported by the conover method.

sigsep / sigsep-mus-2018-analysis

Correct post-hoc test for significance? #2