weighted venn - Githubissues

vianmora commented 3 years ago

Hello, I just discovered this amazing tool, thank you very much !

I was wondering if their was a way to weight the data that we pass to the function ?

Imagine I have 2 sets of houses that i would like to compare (A & B). Each set has the same size. But, the house in the B set that are not in the A set are far bigger than the others. So, if i want to plot the venn diagram of my surfaces, it could show me that the B set cover far more space than the B set :)

hope I was clear

Have a nice day

konstantint commented 3 years ago

There is the (perhaps unfortunately named) venn<x>_unweighted function that does what you want. If used as its normal counterpart it draws an unweighted diagram, but the extra subset_areas parameter allows to specify arbitrary subset areas (unrelated to the numbers drawn on them).

FYI, what it does internally is pretty simple - it first draws the usual diagram with subset_areas and then simply changes the numeric labels to values provided in subsets. I believe that if your diagram shows that one set is bigger, it would be reasonable to see it somehow in the numbers on the diagram as well. So perhaps you should consider either depicting the actual measures of size on the diagram (e.g. total number of rooms), or making explicit labels like "130 rooms (4 houses)". In this case you might want to do the area label renaming manually like it is done there.

vianmora commented 3 years ago

Hello, Thank you very much !!

So, with all your informations, i created the function I wanted, based on the venn<x>_unweighted function :D

Here's the code I used. If you want to integrate it to your package, or update it before, feel free to do it ;)

def venn2_weighted(subset_dicts, set_labels=('A', 'B'), set_colors=('r', 'g'), 
               alpha=0.4, normalize_to=1.0, ax=None, subset_label_formatter=None):

A = pd.DataFrame(subset_dicts[0].items(), columns=['index', 'value_left'])
A = A.groupby('index').sum()

B = pd.DataFrame(subset_dicts[1].items(), columns=['index', 'value_right'])
B = B.groupby('index').sum()

df = A.reset_index().merge(B.reset_index(), on='index', how='outer').set_index('index')
df['diff'] = df['value_left'] - df['value_right']
df['min'] = np.where(df['diff']>0, df['value_right'], df['value_left'])

nb_gauche = df[(df['value_left'].notna()) & (df['value_right'].isna())]['value_left'].sum()
nb_droite = df[(df['value_left'].isna()) & (df['value_right'].notna())]['value_right'].sum()

nb_gauche += df[(df['value_left'].notna()) & (df['value_right'].notna()) & (df['diff']>0)]['diff'].sum()
nb_droite += -df[(df['value_left'].notna()) & (df['value_right'].notna()) & (df['diff']<0)]['diff'].sum()

nb_mixte = df[(df['value_left'].notna()) & (df['value_right'].notna())]['min'].sum()

subset_areas = (nb_gauche, nb_droite, nb_mixte)

v = venn2(subset_areas, set_labels=set_labels)

return v,  subset_areas`

I was obliged to use dict and Dataframe instead of set, but it works ! Do you see a better, faster or more efficient way to do it :)

konstantint commented 3 years ago

I must admit I'm not sure I understand what you are plotting here. What is the contents of subset_dicts and what do you visualize?

(In general, I'm happy you found a solution, but perhaps I could help you simplify it if you clarify what is happening).

vianmora commented 3 years ago

Yes, of course,

'subset_dicts' is a list with 2 dictionnary inside Here for instance, each set has some houses in it and show the surface they represent. The surface of a same house can be different between each set

dictA = {'house1':11, 'house2':20, 'house3':10, 'house4':60, 'house5':12, 'house6':19, }

dictB = {'house1':12, 'house3':10, 'house5':9, 'house12':19, 'house20':19, 'house19':19, }

If i only use the normal venn diagramm with the keys :

venn2(subsets=[set(dictA.keys()),set(dictB.keys())],set_labels= (f'setA', f'setB'))

I have this figure :

But now, if I want to weight each of my houses to show that the B set cover less surface than the A set, i can use this function :

venn_nlogsoc = venn2_weighted(subset_dicts = [dictA, dictB], set_labels= (f'surface setA ({sum(dictA.values())} m²)', f'surface setB ({sum(dictB.values())} m²)') )

I can also pick up the areas with the variable area_set :

venn_nlogsoc, subset_areas = venn2_weighted(subset_dicts = [dictA, dictB], set_labels= (f'surface setA ({sum(dictA.values())} m²)', f'surface setB ({sum(dictB.values())} m²)') )

(102.0, 58.0, 30.0)

For the cases where a same house has 2 differents sizes between the 2 dictionnaries : for instance house5 in dictA, house5 represent 12m² and in dictB, house5 represent 19m² Then min(12,19) = 12 goes to the center And |12-19| goes to the right if (12-19)<0 or to the left if (12-19)>0

konstantint commented 3 years ago

I see. Your problem can be solved in a more concise manner as follows:

from collections import Counter
venn2([Counter(dictA), Counter(dictB)])

Sorry I misunderstood your original question - I though you wanted to have numbers on the diagram different from the areas (which is the most common request).

vianmora commented 3 years ago

haha, yeas !! It was exactly what I needed, thank you very much !

konstantint / matplotlib-venn

weighted venn #62