Post-hocs test for dataframes with different group / block / y column names break

sbuschjaeger commented 3 years ago

Hi,

I cannot use post-hocs test for dataframes with melted = True and group_col != 'groups', block_col != 'blocks' and y_col != 'y'. Basically, anything which deviates from the example

sp.posthoc_nemenyi_friedman(data, y_col='y', block_col='blocks', group_col='groups', melted=True)

breaks the code. The error is likely due to __convert_to_block_df (https://github.com/maximtrp/scikit-posthocs/blob/master/scikit_posthocs/_posthocs.py) which returns the old y_col, group_col, block_col values but assigns the column names "groups" / "blocks" / "y"

def __convert_to_block_df(a, y_col=None, group_col=None, block_col=None, melted=False):
    # ...
    elif isinstance(a, DataFrame) and melted:
        x = DataFrame.from_dict({'groups': a[group_col],
                                 'blocks': a[block_col],
                                 'y': a[y_col]})**
    # ...
    return x, y_col, group_col, block_col

On a somewhat related note: I wanted to implement / use these tests to plot CD diagrams as suggested in "J. Demsar (2006), Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 1-30." which you also cite in the documentation. However, I have a difficult time to understand what "block", "groups", and "y" mean in this context. More specifically, are blocks (or groups?) different classifiers or datasets and is y the ranks or the accuracies? You dont happen to have some example code and or explanation how to plot CD diagrams?

Thank

maximtrp commented 3 years ago

Thank you for reporting! I am surprised nobody came across (or posted) this issue earlier. Perhaps, everyone is happy with the default column names. I have already uploaded the fixed release.

However, I have a difficult time to understand what "block", "groups", and "y" mean in this context. More specifically, are blocks (or groups?) different classifiers or datasets and is y the ranks or the accuracies?

You can find some explanation here. Is it what you are looking for?

You dont happen to have some example code and or explanation how to plot CD diagrams?

Unfortunately, no. But I have found this repo. I need some time to see how such a diagram is plotted. But I guess it can be adapted to the other tests (not only Wilcoxon as in this repo).

sbuschjaeger commented 3 years ago

Thanks for the fix. I guess most people just us the numpy arrays which I now do as well. I also found that repo you mentioned, but the code is rather messy so I decided to implement my own plotting and use your code for the statistical tests.

I saw that example in the Readme and that is what causes the confusion. In your example the columns correspond to primary factors (the yield) and rows correspond to blocking factors (the field). You then perform the Friedman test if there is a difference in the data with the transposed data matrix ss.friedmanchisquare(*data.T). After that, however you do not transpose the data anymore, which confuses me: For my use-case I have a (19, 13) matrix for which I want to compute the pairwise posthoc_wilcoxon statistics. As expected, ss.friedmanchisquare(*data.T) works fine (with transposed data). However, the posthoc_wilcoxon test seems to go over all rows and not columns. Applying that to my data will get me a (19, 19) output, but I expected pairwise tests across the other dimension (and get a (13, 13)) output. I played around with/without transpose of the data, but from the pvalues I get and the resulting dimensions it only makes sense to me to transpose my data for both calls.

maximtrp commented 3 years ago

Please note asterisk that I use before data.T. NumPy arrays are unpacked by rows, and we have groups in columns. So, we need to transpose the array first. In your case, you just need to transpose the array (such that you have the groups in rows), pass it to ss.friedmanchisquare and then to posthoc_wilcoxon function. You should obtain the correct result in that case.

sbuschjaeger commented 3 years ago

Thanks for the clarification

maximtrp / scikit-posthocs

Post-hocs test for dataframes with different group / block / y column names break #46