Open dalensis opened 3 years ago
Updated the script, bests
@dalensis you said that you updated the script, so can you tell me where I can access this updated version?
It is the one in the first comment. I substituted it.
Bests
Il mer 2 mar 2022, 03:46 briannhan @.***> ha scritto:
@dalensis https://github.com/dalensis you said that you updated the script, so can you tell me where I can access this updated version?
— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1056086384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMBE4D7272QWMY4TATU53I75ANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
I see. I modified your script a little bit to improve readibility as well as to apply your script to statsmodels rather than pingouin (note that I changed columns "A", "B", and "pval" to, respectively, "group1", "group2", and "p-adj" to match the output from statsmodels Tukey's). Otherwise, the script is essentially the same
import string
import pandas as pd
import os
from pathlib import Path
def main(df, alpha=0.05):
'''
Creates a compact letter display. This creates a dataframe consisting of
2 columns, a column containing the treatment groups and a column containing
the letters that have been assigned to the treatment groups. These letters
are part of what's called the compact letter display. Treatment groups that
share at least 1 letter are similar to each other, while treatment groups
that don't share any letters are significantly different from each other.
Parameters
----------
df : Pandas dataframe
Pandas dataframe containing raw Tukey test results from statsmodels.
alpha : float
The alpha value. The default is 0.05.
Returns
-------
A dataframe representing the compact letter display, created from the Tukey
test results.
'''
df["p-adj"] = df["p-adj"].astype(float)
# Creating a list of the different treatment groups from Tukey's
group1 = set(df.group1.tolist()) # Dropping duplicates by creating a set
group2 = set(df.group2.tolist()) # Dropping duplicates by creating a set
groupSet = group1 | group2 # Set operation that creates a union of 2 sets
groups = sorted(list(groupSet))
# Creating lists of letters that will be assigned to treatment groups
letters = list(string.ascii_lowercase)[:len(groups)]
cldgroups = letters
# the following algoritm is a simplification of the classical cld,
cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
for row in df.itertuples():
if df["p-adj"][row[0]] > (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]
cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
cld.rename(columns={0: "groups"}, inplace=True)
# this part will reassign the final name to the group
# for sure there are more elegant ways of doing this
cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
cld["labels"] = ""
letters = list(string.ascii_lowercase)
unique = []
for item in cld[2]:
for fitem in cld["labels"].unique():
for c in range(0, len(fitem)):
if not set(unique).issuperset(set(fitem[c])):
unique.append(fitem[c])
g = len(unique)
for kitem in cld[1]:
if kitem in item:
if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
cld["labels"].loc[cld[1] == kitem] += letters[g]
# Checking if columns 1 & 2 of cld share at least 1 letter
if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
cld["labels"].loc[cld[1] == kitem] += letters[g]
if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
cld["labels"].loc[cld[2] == item] += letters[g]
cld = cld.sort_values("labels")
print(cld)
print('\n')
cld.drop(columns=[1, 2], inplace=True)
print(cld)
print('\n')
print('\n')
return(cld)
I've tested this script on the following data. The raw Tukey results are called "Vmax, timePoint x Vegetation". The compact letter display that this script created for these results are called "Vmax, timePoint x Vegetation, wrong". I also manually created a compact letter display called "Vmax, timePoint x Vegetation, correct". This script made a mistake. If you look at groups 5 x CSS and 6 x CSS, the raw Tukey results say that they are different, and my manual compact letter display gave 5 x CSS the letters "cd" and 6 x CSS the letters "ae". However, this script gives 5 x CSS the letters "bc" and 6 x CSS the letters "ab", indicating that these 2 groups are similar.
My modifications weren't major, so I believe that your script is imperfect
Edit: spelling
Here a corrected version. Thank you
Creates a compact letter display. This creates a dataframe consisting of
2 columns, a column containing the treatment groups and a column containing
the letters that have been assigned to the treatment groups. These letters
are part of what's called the compact letter display. Treatment groups that
share at least 1 letter are similar to each other, while treatment groups
that don't share any letters are significantly different from each other.
Parameters
----------
df : Pandas dataframe
Pandas dataframe containing raw Tukey test results from statsmodels.
alpha : float
The alpha value. The default is 0.05.
Returns
-------
A dataframe representing the compact letter display, created from the Tukey
test results.
import string
import pandas as pd
def main(df, alpha=0.05):
df["p-adj"] = df["p-adj"].astype(float)
# Creating a list of the different treatment groups from Tukey's
group1 = set(df.group1.tolist()) # Dropping duplicates by creating a set
group2 = set(df.group2.tolist()) # Dropping duplicates by creating a set
groupSet = group1 | group2 # Set operation that creates a union of 2 sets
groups = sorted(list(groupSet))
# Creating lists of letters that will be assigned to treatment groups
letters = list(string.ascii_lowercase)[:len(groups)]
cldgroups = letters
# the following algoritm is a simplification of the classical cld,
cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
cld[3]=""
for row in df.itertuples():
if df["p-adj"][row[0]] > (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]
if df["p-adj"][row[0]] < (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 3] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 3] += cld.iat[groups.index(df["group1"][row[0]]), 1]
cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
cld[3] = cld[3].apply(lambda x: "".join(sorted(x)))
cld.rename(columns={0: "groups"}, inplace=True)
# this part will reassign the final name to the group
# for sure there are more elegant ways of doing this
cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
cld["labels"] = ""
letters = list(string.ascii_lowercase)
unique = []
for item in cld[2]:
for fitem in cld["labels"].unique():
for c in range(0, len(fitem)):
if not set(unique).issuperset(set(fitem[c])):
unique.append(fitem[c])
g = len(unique)
for kitem in cld[1]:
if kitem in item:
if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
cld["labels"].loc[cld[1] == kitem] += letters[g]
#Checking if there are forbidden pairing (proposition of solution to the imperfect script)
if kitem in ' '.join(cld[3][cld["labels"]==letters[g]]):
g=len(unique)+1
# Checking if columns 1 & 2 of cld share at least 1 letter
if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
cld["labels"].loc[cld[1] == kitem] += letters[g]
if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
cld["labels"].loc[cld[2] == item] += letters[g]
cld = cld.sort_values("labels")
print(cld)
print('\n')
cld.drop(columns=[1, 2, 3], inplace=True)
print(cld)
print('\n')
print('\n')
return(cld)
Thank you so much. I tested your updated script on my dataset and it works on this dataset this time, so thank you for your hard work.
I've also found this paper that discusses algorithms to create compact letter displays. Creating compact letter displays seems to be a topic in the field of computer science with a whole array of literature discussing it, so it's impressive that you created an algorithm that works. Thank you so much
Excellent! Hope it will be useful also to others once implemented :)
Bests
Il gio 3 mar 2022, 20:18 briannhan @.***> ha scritto:
Thank you so much. I tested your updated script on my dataset and it works on this dataset this time, so thank you for your hard work.
I've also found this paper https://www.sciencedirect.com/science/article/pii/S0167947306003598 that discusses algorithms to create compact letter displays. Creating compact letter displays seems to be a topic in the field of computer science with a whole array of literature discussing it, so it's impressive that you created an algorithm that works. Thank you so much
— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1058401024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMM64YZZ5CWBFI537LU6EGBPANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@dalensis I think that you should make a public repository just to hold this script. Perhaps that would increase the visibility of this script and facilitate further testing? It's up to you, though
Now that you adapted it and made it more readable maybe it's possible. At the same time there is no homogeneity in test outputs so it might require further work to make it more adaptable. I will try.
Thanks
Il gio 3 mar 2022, 20:42 briannhan @.***> ha scritto:
@dalensis https://github.com/dalensis I think that you should make a public repository just to hold this script. Perhaps that would increase the visibility of this script and facilitate further testing? It's up to you, though
— Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/205#issuecomment-1058419366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSK5VMK2SYIAWLG2WHYLDLU6EI3RANCNFSM5GNPPFBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Hi, I updated the scrypt and set the depository as you suggested.
https://github.com/dalensis/compact-letter-display-for-pingouin-python/blob/main/cld.py
The code takes the df output of pingouin posthoc analyses and return the compact letter display df.
We have verified that the cld is done correctly using the dataset provided by pinguin and have noticed that it sometimes gives incorrect answers.
import pingouin as pg pgtest = pg.read_dataset('anova') pgtested = pg.pairwise_tukey(data=pgtest, dv='Pain threshold', between='Hair color') pgtested main(pd.DataFrame(pgtested), 95)
pinguin result
cld result
Light Brunette's letter is bcc, but it's not correct. Unfortunately I do not know how to respond to this matter. I will share it with those who can help.
Here a corrected version, there was a problem with the order of the list. I removed the "sorted"
> Creates a compact letter display. This creates a dataframe consisting of
> 2 columns, a column containing the treatment groups and a column containing
> the letters that have been assigned to the treatment groups. These letters
> are part of what's called the compact letter display. Treatment groups that
> share at least 1 letter are similar to each other, while treatment groups
> that don't share any letters are significantly different from each other.
>
> Parameters
> ----------
> df : Pandas dataframe
> Pandas dataframe containing raw Tukey test results from statsmodels.
> alpha : float
> The alpha value. The default is 0.05.
>
> Returns
> -------
> A dataframe representing the compact letter display, created from the Tukey
> test results.
import string
import pandas as pd
def main(df, alpha=0.05):
df["p-adj"] = df["p-adj"].astype(float)
# Creating a list of the different treatment groups from Tukey's
group1 = set(df.group1.tolist()) # Dropping duplicates by creating a set
group2 = set(df.group2.tolist()) # Dropping duplicates by creating a set
groupSet = group1 | group2 # Set operation that creates a union of 2 sets
groups = list(groupSet) #removed sorted from here
# Creating lists of letters that will be assigned to treatment groups
letters = list(string.ascii_lowercase)[:len(groups)]
cldgroups = letters
# the following algoritm is a simplification of the classical cld,
cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
cld[3]=""
for row in df.itertuples():
if df["p-adj"][row[0]] > (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]
if df["p-adj"][row[0]] < (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 3] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 3] += cld.iat[groups.index(df["group1"][row[0]]), 1]
cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
cld[3] = cld[3].apply(lambda x: "".join(sorted(x)))
cld.rename(columns={0: "groups"}, inplace=True)
# this part will reassign the final name to the group
# for sure there are more elegant ways of doing this
cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
cld["labels"] = ""
letters = list(string.ascii_lowercase)
unique = []
for item in cld[2]:
for fitem in cld["labels"].unique():
for c in range(0, len(fitem)):
if not set(unique).issuperset(set(fitem[c])):
unique.append(fitem[c])
g = len(unique)
for kitem in cld[1]:
if kitem in item:
if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
cld["labels"].loc[cld[1] == kitem] += letters[g]
#Checking if there are forbidden pairing (proposition of solution to the imperfect script)
if kitem in ' '.join(cld[3][cld["labels"]==letters[g]]):
g=len(unique)+1
# Checking if columns 1 & 2 of cld share at least 1 letter
if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
cld["labels"].loc[cld[1] == kitem] += letters[g]
if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
cld["labels"].loc[cld[2] == item] += letters[g]
cld = cld.sort_values("labels")
print(cld)
print('\n')
cld.drop(columns=[1, 2, 3], inplace=True)
print(cld)
print('\n')
print('\n')
return(cld)
Is this last answer the correct implementation of the CLD coming from output of the function imported as:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
?
There are a lot of pandas antipatterns (like copying on views) that are a bit worrying to me. Also the results are a bit unintuitive to me.
This is an algorithm made by a python amateur. If you can improve it you will be more than welcome.
Could you post a database which gives unintuitive results with this script? The aim is to find the minimal number of groups which can distinguish the statistically different samples.
First, I would like to thank for this exceptional library.
I wanted to use the R cld function on the output of the pg.pairwise comparisons ttest and gameshowell and I realized there is none in use in python.
It would be nice to implement it if possible. I prepared the following script that should work.
Thanks