raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.63k stars 139 forks source link

Implementation of cld (compact letter display) on the output of pairwise comparisons. #205

Open dalensis opened 3 years ago

dalensis commented 3 years ago

First, I would like to thank for this exceptional library.

I wanted to use the R cld function on the output of the pg.pairwise comparisons ttest and gameshowell and I realized there is none in use in python.

It would be nice to implement it if possible. I prepared the following script that should work.

Thanks

# -*- coding: utf-8 -*-
import string
import pandas as pd

#launch the main function with the output of the pairwise comparison and the CI (for example 99 for alpha=0.01)

def main(df, CI):
    if len(df.index)<2: df = df.rename(columns = {"p-unc" : "pval"})    #the pval column  has different names based on test and numerosity
    else: df = df.rename(columns = {"p-corr" : "pval"})

    groups = sorted(set(df["A"].unique()).union((set(df["B"].unique())))) #take all the names from columns A and B 
    letters = list(string.ascii_uppercase)[:len(groups)]
    cldgroups = letters

    #the following algoritm is a semplification of the classical cld, 

    cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
    for row in df.itertuples():                                 
        if not type(df["pval"][row[0]]) is str and df["pval"][row[0]]>(1-CI/100):
            cld.iat[groups.index(df["A"][row[0]]), 2] +=  cld.iat[groups.index(df["B"][row[0]]), 1]
            cld.iat[groups.index(df["B"][row[0]]), 2] +=  cld.iat[groups.index(df["A"][row[0]]), 1]

    cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))

   #this part will reassign the final name to the group, for sure there are more elegant way of doing it
    cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
    cld["groups"]=""
    letters = list(string.ascii_uppercase)
    unique = []
    for item in cld[2]:

        for fitem in cld["groups"].unique():
            for c in range (0, len(fitem)):
                if not set(unique).issuperset(set(fitem[c])):
                    unique.append(fitem[c])
        g=len(unique)

        for kitem in cld[1]:
            if kitem in item:
                if cld["groups"].loc[cld[1]==kitem].iloc[0]=="": cld["groups"].loc[cld[1]==kitem]+=letters[g]
                if not len(set(cld["groups"].loc[cld[1]==kitem].iloc[0]).intersection(cld.loc[cld[2]==item,"groups"].iloc[0]))>0:
                    if letters[g] not in list(cld["groups"].loc[cld[1]==kitem].iloc[0]): cld["groups"].loc[cld[1]==kitem]+=letters[g]
                    if letters[g] not in list(cld["groups"].loc[cld[2]==item].iloc[0]): cld["groups"].loc[cld[2]==item]+=letters[g]

    cld = cld.sort_values("groups") 
    print(cld)
    return(cld) #return the df. In my base script i catch it, save to xls, and use the groups to tag the column of the plot.
dalensis commented 2 years ago

Updated the script, bests

briannhan commented 2 years ago

@dalensis you said that you updated the script, so can you tell me where I can access this updated version?

dalensis commented 2 years ago

It is the one in the first comment. I substituted it.

Bests

Il mer 2 mar 2022, 03:46 briannhan @.***> ha scritto:

@dalensis https://github.com/dalensis you said that you updated the script, so can you tell me where I can access this updated version?

— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1056086384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMBE4D7272QWMY4TATU53I75ANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

briannhan commented 2 years ago

I see. I modified your script a little bit to improve readibility as well as to apply your script to statsmodels rather than pingouin (note that I changed columns "A", "B", and "pval" to, respectively, "group1", "group2", and "p-adj" to match the output from statsmodels Tukey's). Otherwise, the script is essentially the same

import string
import pandas as pd
import os
from pathlib import Path

def main(df, alpha=0.05):
    '''
    Creates a compact letter display. This creates a dataframe consisting of
    2 columns, a column containing the treatment groups and a column containing
    the letters that have been assigned to the treatment groups. These letters
    are part of what's called the compact letter display. Treatment groups that
    share at least 1 letter are similar to each other, while treatment groups
    that don't share any letters are significantly different from each other.

    Parameters
    ----------
    df : Pandas dataframe
        Pandas dataframe containing raw Tukey test results from statsmodels.
    alpha : float
        The alpha value. The default is 0.05.

    Returns
    -------
    A dataframe representing the compact letter display, created from the Tukey
    test results.

    '''
    df["p-adj"] = df["p-adj"].astype(float)

    # Creating a list of the different treatment groups from Tukey's
    group1 = set(df.group1.tolist())  # Dropping duplicates by creating a set
    group2 = set(df.group2.tolist())  # Dropping duplicates by creating a set
    groupSet = group1 | group2  # Set operation that creates a union of 2 sets
    groups = sorted(list(groupSet))

    # Creating lists of letters that will be assigned to treatment groups
    letters = list(string.ascii_lowercase)[:len(groups)]
    cldgroups = letters

    # the following algoritm is a simplification of the classical cld,

    cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
    for row in df.itertuples():
        if df["p-adj"][row[0]] > (alpha):
            cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
            cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]

    cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
    cld.rename(columns={0: "groups"}, inplace=True)

    # this part will reassign the final name to the group
    # for sure there are more elegant ways of doing this
    cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
    cld["labels"] = ""
    letters = list(string.ascii_lowercase)
    unique = []
    for item in cld[2]:

        for fitem in cld["labels"].unique():
            for c in range(0, len(fitem)):
                if not set(unique).issuperset(set(fitem[c])):
                    unique.append(fitem[c])
        g = len(unique)

        for kitem in cld[1]:
            if kitem in item:
                if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
                    cld["labels"].loc[cld[1] == kitem] += letters[g]

                # Checking if columns 1 & 2 of cld share at least 1 letter
                if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
                    if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
                        cld["labels"].loc[cld[1] == kitem] += letters[g]
                    if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
                        cld["labels"].loc[cld[2] == item] += letters[g]

    cld = cld.sort_values("labels")
    print(cld)
    print('\n')
    cld.drop(columns=[1, 2], inplace=True)
    print(cld)
    print('\n')
    print('\n')
    return(cld)

I've tested this script on the following data. The raw Tukey results are called "Vmax, timePoint x Vegetation". The compact letter display that this script created for these results are called "Vmax, timePoint x Vegetation, wrong". I also manually created a compact letter display called "Vmax, timePoint x Vegetation, correct". This script made a mistake. If you look at groups 5 x CSS and 6 x CSS, the raw Tukey results say that they are different, and my manual compact letter display gave 5 x CSS the letters "cd" and 6 x CSS the letters "ae". However, this script gives 5 x CSS the letters "bc" and 6 x CSS the letters "ab", indicating that these 2 groups are similar.

My modifications weren't major, so I believe that your script is imperfect

Edit: spelling

dalensis commented 2 years ago

Here a corrected version. Thank you

Creates a compact letter display. This creates a dataframe consisting of
2 columns, a column containing the treatment groups and a column containing
the letters that have been assigned to the treatment groups. These letters
are part of what's called the compact letter display. Treatment groups that
share at least 1 letter are similar to each other, while treatment groups
that don't share any letters are significantly different from each other.

Parameters
----------
df : Pandas dataframe
    Pandas dataframe containing raw Tukey test results from statsmodels.
alpha : float
    The alpha value. The default is 0.05.

Returns
-------
A dataframe representing the compact letter display, created from the Tukey
test results.

import string
import pandas as pd

def main(df, alpha=0.05):

    df["p-adj"] = df["p-adj"].astype(float)

    # Creating a list of the different treatment groups from Tukey's
    group1 = set(df.group1.tolist())  # Dropping duplicates by creating a set
    group2 = set(df.group2.tolist())  # Dropping duplicates by creating a set
    groupSet = group1 | group2  # Set operation that creates a union of 2 sets
    groups = sorted(list(groupSet))

    # Creating lists of letters that will be assigned to treatment groups
    letters = list(string.ascii_lowercase)[:len(groups)]
    cldgroups = letters

    # the following algoritm is a simplification of the classical cld,

    cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
    cld[3]=""

    for row in df.itertuples():
        if df["p-adj"][row[0]] > (alpha):
            cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
            cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]

        if df["p-adj"][row[0]] < (alpha):
                cld.iat[groups.index(df["group1"][row[0]]), 3] +=  cld.iat[groups.index(df["group2"][row[0]]), 1]
                cld.iat[groups.index(df["group2"][row[0]]), 3] +=  cld.iat[groups.index(df["group1"][row[0]]), 1]

    cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
    cld[3] = cld[3].apply(lambda x: "".join(sorted(x)))
    cld.rename(columns={0: "groups"}, inplace=True)

    # this part will reassign the final name to the group
    # for sure there are more elegant ways of doing this
    cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
    cld["labels"] = ""
    letters = list(string.ascii_lowercase)
    unique = []
    for item in cld[2]:

        for fitem in cld["labels"].unique():
            for c in range(0, len(fitem)):
                if not set(unique).issuperset(set(fitem[c])):
                    unique.append(fitem[c])
        g = len(unique)

        for kitem in cld[1]:
            if kitem in item:
                if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
                    cld["labels"].loc[cld[1] == kitem] += letters[g]

                #Checking if there are forbidden pairing (proposition of solution to the imperfect script)                
                if kitem in ' '.join(cld[3][cld["labels"]==letters[g]]): 
                    g=len(unique)+1

                # Checking if columns 1 & 2 of cld share at least 1 letter
                if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
                    if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
                        cld["labels"].loc[cld[1] == kitem] += letters[g]
                    if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
                        cld["labels"].loc[cld[2] == item] += letters[g]

    cld = cld.sort_values("labels")
    print(cld)
    print('\n')
    cld.drop(columns=[1, 2, 3], inplace=True)
    print(cld)
    print('\n')
    print('\n')
    return(cld)
briannhan commented 2 years ago

Thank you so much. I tested your updated script on my dataset and it works on this dataset this time, so thank you for your hard work.

I've also found this paper that discusses algorithms to create compact letter displays. Creating compact letter displays seems to be a topic in the field of computer science with a whole array of literature discussing it, so it's impressive that you created an algorithm that works. Thank you so much

dalensis commented 2 years ago

Excellent! Hope it will be useful also to others once implemented :)

Bests

Il gio 3 mar 2022, 20:18 briannhan @.***> ha scritto:

Thank you so much. I tested your updated script on my dataset and it works on this dataset this time, so thank you for your hard work.

I've also found this paper https://www.sciencedirect.com/science/article/pii/S0167947306003598 that discusses algorithms to create compact letter displays. Creating compact letter displays seems to be a topic in the field of computer science with a whole array of literature discussing it, so it's impressive that you created an algorithm that works. Thank you so much

— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1058401024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMM64YZZ5CWBFI537LU6EGBPANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

briannhan commented 2 years ago

@dalensis I think that you should make a public repository just to hold this script. Perhaps that would increase the visibility of this script and facilitate further testing? It's up to you, though

dalensis commented 2 years ago

Now that you adapted it and made it more readable maybe it's possible. At the same time there is no homogeneity in test outputs so it might require further work to make it more adaptable. I will try.

Thanks

Il gio 3 mar 2022, 20:42 briannhan @.***> ha scritto:

@dalensis https://github.com/dalensis I think that you should make a public repository just to hold this script. Perhaps that would increase the visibility of this script and facilitate further testing? It's up to you, though

— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1058419366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMK2SYIAWLG2WHYLDLU6EI3RANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

dalensis commented 2 years ago

Hi, I updated the scrypt and set the depository as you suggested.

https://github.com/dalensis/compact-letter-display-for-pingouin-python/blob/main/cld.py

The code takes the df output of pingouin posthoc analyses and return the compact letter display df.

montagne5641 commented 1 year ago

We have verified that the cld is done correctly using the dataset provided by pinguin and have noticed that it sometimes gives incorrect answers.

import pingouin as pg pgtest = pg.read_dataset('anova') pgtested = pg.pairwise_tukey(data=pgtest, dv='Pain threshold', between='Hair color') pgtested main(pd.DataFrame(pgtested), 95)

pinguin result image

cld result image

Light Brunette's letter is bcc, but it's not correct. Unfortunately I do not know how to respond to this matter. I will share it with those who can help.

dalensis commented 1 year ago

Here a corrected version, there was a problem with the order of the list. I removed the "sorted"

> Creates a compact letter display. This creates a dataframe consisting of
> 2 columns, a column containing the treatment groups and a column containing
> the letters that have been assigned to the treatment groups. These letters
> are part of what's called the compact letter display. Treatment groups that
> share at least 1 letter are similar to each other, while treatment groups
> that don't share any letters are significantly different from each other.
> 
> Parameters
> ----------
> df : Pandas dataframe
>     Pandas dataframe containing raw Tukey test results from statsmodels.
> alpha : float
>     The alpha value. The default is 0.05.
> 
> Returns
> -------
> A dataframe representing the compact letter display, created from the Tukey
> test results.

import string
import pandas as pd

def main(df, alpha=0.05):

    df["p-adj"] = df["p-adj"].astype(float)

    # Creating a list of the different treatment groups from Tukey's
    group1 = set(df.group1.tolist())  # Dropping duplicates by creating a set
    group2 = set(df.group2.tolist())  # Dropping duplicates by creating a set
    groupSet = group1 | group2  # Set operation that creates a union of 2 sets
    groups = list(groupSet) #removed sorted from here

    # Creating lists of letters that will be assigned to treatment groups
    letters = list(string.ascii_lowercase)[:len(groups)]
    cldgroups = letters

    # the following algoritm is a simplification of the classical cld,

    cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
    cld[3]=""

    for row in df.itertuples():
        if df["p-adj"][row[0]] > (alpha):
            cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
            cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]

        if df["p-adj"][row[0]] < (alpha):
                cld.iat[groups.index(df["group1"][row[0]]), 3] +=  cld.iat[groups.index(df["group2"][row[0]]), 1]
                cld.iat[groups.index(df["group2"][row[0]]), 3] +=  cld.iat[groups.index(df["group1"][row[0]]), 1]

    cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
    cld[3] = cld[3].apply(lambda x: "".join(sorted(x)))
    cld.rename(columns={0: "groups"}, inplace=True)

    # this part will reassign the final name to the group
    # for sure there are more elegant ways of doing this
    cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
    cld["labels"] = ""
    letters = list(string.ascii_lowercase)
    unique = []
    for item in cld[2]:

        for fitem in cld["labels"].unique():
            for c in range(0, len(fitem)):
                if not set(unique).issuperset(set(fitem[c])):
                    unique.append(fitem[c])
        g = len(unique)

        for kitem in cld[1]:
            if kitem in item:
                if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
                    cld["labels"].loc[cld[1] == kitem] += letters[g]

                #Checking if there are forbidden pairing (proposition of solution to the imperfect script)                
                if kitem in ' '.join(cld[3][cld["labels"]==letters[g]]): 
                    g=len(unique)+1

                # Checking if columns 1 & 2 of cld share at least 1 letter
                if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
                    if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
                        cld["labels"].loc[cld[1] == kitem] += letters[g]
                    if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
                        cld["labels"].loc[cld[2] == item] += letters[g]

    cld = cld.sort_values("labels")
    print(cld)
    print('\n')
    cld.drop(columns=[1, 2, 3], inplace=True)
    print(cld)
    print('\n')
    print('\n')
    return(cld)
CarloNicolini commented 9 months ago

Is this last answer the correct implementation of the CLD coming from output of the function imported as: from statsmodels.stats.multicomp import pairwise_tukeyhsd? There are a lot of pandas antipatterns (like copying on views) that are a bit worrying to me. Also the results are a bit unintuitive to me.

dalensis commented 9 months ago

This is an algorithm made by a python amateur. If you can improve it you will be more than welcome.

Could you post a database which gives unintuitive results with this script? The aim is to find the minimal number of groups which can distinguish the statistically different samples.