paulbrodersen / matplotlib_set_diagrams

Draw Euler diagrams and Venn diagrams with Matplotlib.
GNU General Public License v3.0
3 stars 0 forks source link

WordClouds by Word Frequency #1

Open PUW149 opened 3 months ago

PUW149 commented 3 months ago

Hi, I am trying to WordCloud Venn Diagrams for 2 sets, this but alter so that Font size for the words is based on frequency count. I can see how to do this in word_cloud but can't work out how to do this using matplotlib_set_diagrams calling to word_cloud, Probably not too relevant but this I'm using Python 3.12 Thanks, Simon

paulbrodersen commented 3 months ago

Hi Simon,

Thanks for raising the issue. Unfortunately, this is not a feature that is properly supported, and likely never will be.

If you only care about the relative scaling of word sizes within each subset, then the feature can be easily monkey-patched in:

import numpy as np
import matplotlib.pyplot as plt

from matplotlib_set_diagrams import EulerDiagram
from wordcloud import WordCloud

word_frequency = {
    "a" : 1,
    "b" : 2,
    "c" : 3,
    "d" : 100,
    "e" : 0.1,
}

class CustomEulerDiagram(EulerDiagram):

    def _generate_subset_wordcloud(
            self,
            subset,
            mask,
            rgba,
            wordcloud_kwargs,
    ):

        mask = 255 * np.invert(mask).astype(np.uint8) # black is filled by WordCloud
        rgba_as_tuple = tuple(int(255 * channel) for channel in rgba)

        wc = WordCloud(
            mask             = mask,
            mode             = "RGBA",
            background_color = None,
            color_func       = lambda *args, **kwargs : rgba_as_tuple,
            **wordcloud_kwargs
        )

        subset_word_frequency = {word : word_frequency[word] for word in subset}

        return wc.generate_from_frequencies(subset_word_frequency).to_array()

fig, (ax1, ax2) = plt.subplots(1, 2)

EulerDiagram.as_wordcloud(
    [
        {"a", "b", "c", "d"},
        {"d", "e"},
    ],
    wordcloud_kwargs=dict(relative_scaling=1),
    ax=ax1)
ax1.set_title("Original")

CustomEulerDiagram.as_wordcloud(
    [
        {"a", "b", "c", "d"},
        {"d", "e"},
    ],
    wordcloud_kwargs=dict(relative_scaling=1),
    ax=ax2)
ax2.set_title("Custom")

plt.show()

Figure_1

However, do notice that while the set elements a, b, and c are displayed proportionally to their indicated frequencies, d and e are not. This is because the diagram above is generated using three independent WordCloud instances, one for each subset. Each WordCloud instance tries to fill the area given to it (this is what a a word cloud does), and hence e is plotted much larger than the specified frequency would suggest, and the relative sizes of the different elements are not commensurate across the subsets.

I have tried addressing this issue in the precursor to this library. However, even though this feature ended up being responsible for the majority of the code base, and resulted in ugly visualizations, it still caused issues that weren't easily fixed.

Ultimately, this is the reason why I decided not to support the feature in this library. However, you are more than welcome to try to implement the feature yourself. If you do manage to get it working properly, pull request are always welcome.

Sorry for not bringing better news, Paul

PUW149 commented 3 months ago

Thanks Paul, your suggestion for a monkey-patch is very helpful (I'm not too worried about relative scaling between subsets) so it was good news after all! Best, Simon