paulbrodersen / matplotlib_venn_wordcloud

Venn diagrams with word clouds
MIT License
49 stars 12 forks source link

min and max font sizes #5

Open jemielniak opened 4 years ago

jemielniak commented 4 years ago

Hi,

I love your concept and I think it is a great tool! I'm struggling with setting minimum and maximum font sizes for the words - is there a way to set it (as in a regular wordcloud)?

paulbrodersen commented 4 years ago

Hi @jemielniak

There are maybe three pieces of code on my github that I had hoped to never, ever having to get back to. Figuring out the font sizes for wordcloud was one of them.

The issue is that fontsizes need to be consistent (i.e. reflect the overall word frequency) across the different patches (subsets of the venn diagram). However, you can only figure out the maximum fontsize for each patch by running the packing algorithm. So you need to run wordcloud twice, first without setting the maximum font size, then with setting the maximum fontsize such that the relative word frequencies appear consistent across patches.

Ditto for the minimum font size.

And then you come in and want to set those manually.

Nevertheless, I think I managed to find a solution that allows the user to specify the font range while keeping the scaling consistent across subsets.

Can you please test the current version on github (master branch), see if this behaves as you would expect, and report back?

As outlined in the docstrings, you can use the argument wordcloud_kwargs to pass a dict of keyword, value pairs to wordcloud. So your function call should look something like this:

from matplotlib_venn_wordcloud import venn2_wordcloud, _default_color_func
sets = ...
venn2_wordcloud(sets, wordcloud_kwargs=dict(max_font_size=50, min_font_size=30, color_func=_default_color_func))
paulbrodersen commented 4 years ago

1 sec, github authentication is not letting me push....

paulbrodersen commented 4 years ago

Changes successfully pushed now (commits 891a88c to 3f63864).

jemielniak commented 4 years ago

Many thanks for your reply! I'm amazed how you've been able to work around this.

The code works - in a sense that it does accept the min and max definiton. I can set some combinations of minimum and maximum font for which the wordclouds are visible, but for some they are not (I think it may be an issue with their proportion/ratio?).

When I add the word_to_frequency parameter, the code does not work, so: wordcloud_kwargs=dict(max_font_size=25, min_font_size=12) is fine wordcloud_kwargs=dict(max_font_size=25, min_font_size=12), word_to_frequency=dictionary) does not.

It does not seem to really differentiate sizes besides the 2 (so it is not the maximum size and a linear distribution according to frequency with a minimum font set as well).

I understand that it is a very specific user case, and that obviously you don't have time to delve into every single issue. I really appreciate your amazingly useful tool! I'll just describe what I'm trying to achieve, as it may be useful to exemplify a situation that is more generalizable.

I'm trying to create a Venn diagram wordcloud of media sources used by two different groups of people (to show what sources they share, and what they don't). With the top source having a frequency of 91, fourth just 7, and 18th being already 1, the distribution is quite steep.

When I play around with just top 10, If I use max_font_size=30, min_font_size=10, the result looks like this: image

While if I just rely on word frequency, the result is like this: image

It appears basically as if when using the word frequency function, there is no minimum font size, and as a result all less frequent results disappear into oblivion.

In an ideal state, and I realize that it is not something you may be interested in doing or have time to do like, ever - the most frequent word would be quite large, filling the available breadth, and the least frequent word in a set would still be visible (hence the minimum font size).

Anyhow, thanks so much for listening - I'm a beginner who learned a lot from your great module, and I'm super grateful for that (just as much as for the functionality you've provided). Stay safe!

paulbrodersen commented 4 years ago

Your results look roughly what I would expect them to be but maybe your intuition/expectation is the better choice.

At the moment, I am not forcing the text object with the smallest fontsize to have a fontsize of min_font_size. Only if that object has a smaller fontsize than that, I rescale all fontsizes such that the minimum is min_font_size. Basically, the primary objective is still filling the area, having a high signal-to-noise ratio with respect to the frequencies is secondary. The first plot hence doesn't show a wide range of font sizes because there is plenty of room for words with smaller frequencies.

I would like to experiment a bit with this example. Can you

  1. export the word_to_frequency as a csv (as many words as possible, don't cull the list to the top 100, yet) and upload it here, e.g. via
import pandas as pd
df = pd.DataFrame(word_to_frequency)
df.to_csv('/path/to/file.csv')

and then

  1. post the exact commands that you used to create the figures (starting from the csv file)?
yzpan1 commented 1 year ago

I love this tool! Got two question, how can I set the size of the diagram? And when I save the figure the resolution is really low, how can I adjust the resolution so it's more clear?

paulbrodersen commented 1 year ago

@yuanzhop Your questions don't seem relate to this issue. Could you open new issues with your questions, please? Then I will try to address them there.

yzpan1 commented 1 year ago

@yuanzhop Your questions don't seem relate to this issue. Could you open new issues with your questions, please? Then I will try to address them there.

Apologies, will do!