Closed deeuu closed 6 years ago
Thanks for the input. I was actually hoping to get some feedback on this ;-)
are you or do you plan on aggregating the scores over time first (the framewise measures)?
In this analysis I am not aggregating the frames first. However, I did this last year and you are probably correct about this as the Friedman test is not effective for unequal sample size. I will investigate other tests that may not bound to this problems.
I'm also a little scared about those large negative scores I've seen at the start/end of a song.
Even worse is that we have lots of nan values for some tracks and methods. I would propose to balance the track samples by removing the frames where all participants have a nan
score but keep those where only some methods are nan
.
Hey,
Friedman test is not effective for unequal sample size
It doesn't matter if they were equal anyway, you have a nesting of time
within song
(I think that's the correct terminology), so the measurements aren't independent. The test you apply will see a larger sample size as you add more time points, but you still only have a fixed amount of songs. This sort of issue isn't a problem if you can specify the structure of the data, but the classic approach is to average over one of em. I think this is how Andrew Simpson handled independence in his paper - section A.
Even worse is that we have lots of nan values for some tracks and methods
Yeah, I spotted a few of those!
If the nan
values tend to occur at the onset and offset of the song, then I would probably drop the same set of nan
indices from all methods (even if there are valid measures for some). If someone has messed up their submission, then this isn't a sensible approach, so I would drop nans
on a per-method basis. After doing this, I would compute a median (or some percentile) score for each stem (for a given song and method).
Hopefully, that makes some sense ;-)
What do you think?
It doesn't matter if they were equal anyway, you have a nesting of time within song (I think that's the correct terminology), so the measurements aren't independent. The test you apply will see a larger sample size as you add more time points, but you still only have a fixed amount of songs. This sort of issue isn't a problem if you can specify the structure of the data, but the classic approach is to average over one of em. I think this is how Andrew Simpson handled independence in his paper - section A.
Yes I read his paper a while back ago. I think I will do the trackwise mean then. Concerning the pairwise comparisons, I think I'll stick with the conover-iman test compared to Andrew's use of the wilcoxon sign-rank test (which I used for SiSEC 2016 as well). See here for some arguments...
if the nan values tend to occur at the onset and offset of the song, then I would probably drop the same set of nan indices from all methods (even if there are valid measures for some). If someone has messed up their submission, then this isn't a sensible approach, so I would drop nans on a per-method basis. After doing this, I would compute a median (or some percentile) score for each stem (for a given song and method).
yes
I will update the code today and move it over to google colab. Feel free to comment there
feel free to comment here: https://drive.google.com/file/d/1DoGm0WizK_jmgdo1lSVAQRTMESNr6IyO/view?usp=sharing
the stats are aggregated using the mean.
the nan
filtering is still missing though
Hey,
Thanks again for all the great work on SiSEC 2018!
Quick one:
When computing summary statistics and comparing distributions of the BSS Eval measures across algorithms, e.g. as done here, are you or do you plan on aggregating the scores over time first (the framewise measures)?
I would not, for example, compare two algorithms by using the pooled data (e.g. all SAR values over time and songs), but rather take the medians of the framewise measures to obtain a set of per-song median scores from which I can compare.
I'm also a little scared about those large negative scores I've seen at the start/end of a song.
Cheers