paulbrodersen / matplotlib_set_diagrams

Draw Euler diagrams and Venn diagrams with Matplotlib.
GNU General Public License v3.0
5 stars 1 forks source link

Align circles in X/Y direction #5

Open moi90 opened 1 month ago

moi90 commented 1 month ago

Currently, set circles are free to move in any direction during layout optimization. However, sometimes this additional degree of freedom is not needed, e.g. for example 5 in the docs:

sphx_glr_plot_05_cost_function_objectives_001

I would be nice to have the option to restrict the optimization so that the circle centers stay on the X axis (or the Y axis). In my opinion, this could look much cleaner in some cases.

I have played around with EulerDiagram._optimize_layout and its seems to be enough to just set the Y (or X) component of the origins array to zero (in the cost function and after the optimization). (I also tried to introduce a penalty for y values, but that does not lead to a complete axis alignment.)

grafik

When enabling this, it would also be desirable to place the subset labels on the same axis (x in this case). (Unlike in my example.) And maybe place the set labels on the other axis (top or bottom in my example).

I hope I could get across what I mean...

Once more, it is a pleasure to work with your code!

paulbrodersen commented 1 month ago

I would be nice to have the option to restrict the optimization so that the circle centers stay on the X axis (or the Y axis). In my opinion, this could look much cleaner in some cases.

My first reaction is to resist the temptation to support special cases such as the one that you outlined to ensure that the code base remains as simple as possible -- it makes maintenance easier and on-boarding of new contributors such as yourself possible. However, a write-up of your approach could make a great (advanced) example for how to tweak the layout in the documentation. Do you want to have a stab at that?

When enabling this, it would also be desirable to place the subset labels on the same axis (x in this case). (Unlike in my example.)

The placement of the subset labels is currently bugged / inaccurate. They should be placed at the point of inaccessibility, but clearly aren't in your (last) example. In your example, this should result in the subset labels being placed on the x-axis. I suspect this is a precision issue in shapely, but I haven't had/made the time to look into it further.

And maybe place the set labels on the other axis (top or bottom in my example).

I do dislike how the set labels are currently placed. Basically, I draw a line from the center of mass of the whole diagram through the origin of each set and then place the label on that line just outside the set that is being labelled. While being a decent heuristic that yields OK results in 80-90% of the cases, this leaves a bit to be desired. I have made notes on two-and-a-half other layout ideas.

  1. left-right: Place the set labels aligned with the y-coordinate of the corresponding circle centers either left or right of the diagram, whichever is closer.
  2. top-bottom: Place the set labels aligned with the x-coordinate of the corresponding circle centers either at the top or the bottom.
  3. least ambiguous: For sets that are not strict subsets of another set, find the center of the "outside" arc and place the label there just outside of the diagram. For sets that are strict subsets, combine the set label with the subset label and place the combined label at the point of inaccessibility.

The last idea is not quite sufficient yet, though, as sets that aren't strict subsets but where the union of two or more sets does form a superset aren't handled by this approach (e.g. {a. b}, {b, c}, {c, d}).

moi90 commented 1 month ago

My first reaction is to resist the temptation to support special cases such as the one that you outlined to ensure that the code base remains as simple as possible -- it makes maintenance easier and on-boarding of new contributors such as yourself possible. However, a write-up of your approach could make a great (advanced) example for how to tweak the layout in the documentation. Do you want to have a stab at that?

I get that. However, please bear in mind: This would only be one additional parameter to a Diagram class and two small additions to the _optimize_layout method:

```python class MyEulerDiagram(EulerDiagram): def _optimize_layout( self, subset_sizes: Mapping[Tuple[bool], int | float], origins: NDArray, radii: NDArray, objective: str, verbose: bool, ) -> Tuple[NDArray, NDArray]: """Optimize the placement of circle origins according to the given cost function objective. """ desired_areas = np.array(list(subset_sizes.values())) def cost_function(flattened_origins): origins = flattened_origins.reshape(-1, 2) ## NOTE: Add this line: origins[:, 1] = 0 subset_areas = np.array( [ geometry.area for geometry in self._get_subset_geometries( subset_sizes.keys(), origins, radii ).values() ] ) if objective == "simple": cost = subset_areas - desired_areas elif objective == "squared": cost = (subset_areas - desired_areas) ** 2 elif objective == "relative": with warnings.catch_warnings(): warnings.filterwarnings( "ignore", message="divide by zero encountered in scalar divide" ) cost = [ 1 - min(x / y, y / x) if x != y else 0.0 for x, y in zip(subset_areas, desired_areas) ] elif objective == "logarithmic": cost = np.log(subset_areas + 1) - np.log(desired_areas + 1) elif objective == "inverse": eps = 1e-2 * np.sum(desired_areas) cost = 1 / (subset_areas + eps) - 1 / (desired_areas + eps) else: msg = f"The provided cost function objective is not implemented: {objective}." msg += "\nAvailable objectives are: 'simple', 'squared', 'logarithmic', 'relative', and 'inverse'." raise ValueError(msg) return np.sum(np.abs(cost)) # constraints: eps = np.min(radii) * 0.01 lower_bounds = np.abs(radii[np.newaxis, :] - radii[:, np.newaxis]) - eps lower_bounds[lower_bounds < 0] = 0 lower_bounds = squareform(lower_bounds) upper_bounds = radii[np.newaxis, :] + radii[:, np.newaxis] + eps upper_bounds -= np.diag( np.diag(upper_bounds) ) # squareform requires zeros on diagonal upper_bounds = squareform(upper_bounds) def constraint_function(flattened_origins): origins = np.reshape(flattened_origins, (-1, 2)) return pdist(origins) distance_between_origins = NonlinearConstraint( constraint_function, lb=lower_bounds, ub=upper_bounds ) result = minimize( cost_function, origins.flatten(), method="SLSQP", constraints=[distance_between_origins], options=dict(disp=verbose, eps=eps), ) if not result.success: feedback = "Could not optimise layout for the given subsets. Try a different cost function objective." warnings.warn(f"{result.message}. {feedback}") origins = result.x.reshape((-1, 2)) ## NOTE: Add this line origins[:, 1] = 0 return origins, radii ```

The placement of the subset labels is currently bugged / inaccurate. They should be placed at the point of inaccessibility, but clearly aren't in your (last) example. In your example, this should result in the subset labels being placed on the x-axis. I suspect this is a precision issue in shapely, but I haven't had/made the time to look into it further.

Yes, I stumbled about that concept of POI in the code. Yes, you're right! In SetDiagram._draw_subset_labels, I have to decrease the tolerance of polylabel to 0.0001 to get the correct (horizontal) alignment. Do you think there is a an automatic way to select the tolerance? Or is it OK to just always use a ridiculously small number?

```python def _draw_subset_labels( self, subset_labels: Mapping[Tuple[bool], str], subset_geometries: Mapping[Tuple[bool], ShapelyPolygon], subset_colors: Mapping[Tuple[bool], NDArray], ax: plt.Axes, ) -> dict[Tuple[bool], plt.Text]: """Place subset labels centred on the point of inaccesibility (POI) of the corresponding polygon. """ subset_label_artists = dict() tolerance = 0.0001 for subset, label in subset_labels.items(): geometry = subset_geometries[subset] if geometry.area > 0: if isinstance(geometry, ShapelyPolygon): poi = polylabel(geometry, tolerance) elif isinstance(geometry, ShapelyMultiPolygon): # use largest sub-geometry poi = polylabel(max(geometry.geoms, key=lambda x: x.area), tolerance) else: raise TypeError( f"Shapely returned neither a Polygon or MultiPolygon but instead {type(geometry)} object!" ) fontcolor = ( "black" if rgba_to_grayscale(*subset_colors[subset]) > 0.5 else "white" ) subset_label_artists[subset] = ax.text( poi.x, poi.y, label, color=fontcolor, va="center", ha="center" ) return subset_label_artists ```

Here is my code to replicate this figure:

```python subset_labels = { # A*, A, P (1, 1, 1): r"$P \wedge A$", (1, 0, 1): r"$P \wedge A*$", # (1, 1, 0): r"$A \wedge \neg P$", # (1, 0, 0): r"$A \wedge \neg P$", (0, 0, 1): r"$P \setminus A*$", } MyEulerDiagram( { # A*, A, P (1, 1, 1): 1, (1, 0, 1): 1, (1, 1, 0): 1, (1, 0, 0): 1, (0, 0, 1): 1, }, set_labels=["A*", "A", "P"], subset_label_formatter=lambda subset, size: subset_labels.get(subset, ""), ax=ax, ) ```

For sets that are strict subsets, combine the set label with the subset label and place the combined label at the point of inaccessibility.

Hmm. So in my example, the $A$ would move inside the reddish crescent? I would then mistake it as an annotation for $A \setminus P$... (Unless the above condition is extended: no other than the "parent" set overlap this set.) I would always place set labels outside of the set. Beyond that, is there any convention how set labels and subset labels can be distinguished (bold font, same color)? Currently, $A$ is placed inside the blue ring, so it could also be a subset label.

paulbrodersen commented 1 month ago

This would only be one additional parameter to a Diagram class and two small additions to the _optimize_layout method:

Presumably, _initialize_layout would also have to be changed....

Or is it OK to just always use a ridiculously small number?

I think we want to use the smallest number that doesn't cause a substantial increase in running time. I have run some preliminary tests using your example with different tolerance values (albeit without your changes to the optimization):

| tolerance | time | increase |
|-----------+------+----------|
|         1 | 1.66 | 0 %      |
|      0.01 | 1.73 | 4 %      |
|     0.001 | 1.88 | 13 %     |
|    0.0001 | 2.57 | 55 %     |

I think a 5% increase is negligible, a 13% increase is tolerable; a 55% increase seems too much to be a sensible default given that I can't see much improvement in any of my test cases beyond a tolerance of 0.01. However, to accommodate cases such as yours, we could expose the tolerance parameter as a global variable. Then you could set a lower value using the following syntax:

import matplotlib_set_diagrams as msd
msd._diagram_classes.POLYLABEL_TOLERANCE = 1e-4
EulerDiagram(...)

Have a look at the commit I linked above.

paulbrodersen commented 1 month ago

Hmm. So in my example, the A would move inside the reddish crescent? I would then mistake it as an annotation for A ∖ P ... (Unless the above condition is extended: no other than the "parent" set overlap this set.) I would always place set labels outside of the set. Beyond that, is there any convention how set labels and subset labels can be distinguished (bold font, same color)? Currently, A is placed inside the blue ring, so it could also be a subset label.

Yeah, these are all valid points. I do like the idea of styling the subset and set labels differently.

paulbrodersen commented 1 month ago

Beyond that, is there any convention how set labels and subset labels can be distinguished (bold font, same color)?

Matploltib has a similar issue with axes labels and tick labels. They use font size as the distinguishing factor ("small" for tick labels; "large" for axis labels). I have copied their approach for the time being. Not perfect, but unintrusive:

test_EulerDiagram

moi90 commented 1 month ago

Presumably, _initialize_layout would also have to be changed....

That's what I thought at first, too. I did an initialization where all circles are placed on a line. But it didn't make a difference in my case... Also, when just clamping y=0, it does not make a difference, if the origins where initially placed on a circle or on a line.

I think we want to use the smallest number that doesn't cause a substantial increase in running time. [...] However, to accommodate cases such as yours, we could expose the tolerance parameter as a global variable. Then you could set a lower value using the following syntax:

I like that! Or would it make sense to have POLYLABEL_TOLERANCE as a class attribute of SetDiagram? Then, one would not have to import it separately from a private sub-module.

However, if we already know that all the subset labels must land on the x axis (because all origins are on the x axis as well), a simpler algorithm could be used where the intersection of the subset geometry and the x axis is calculated and then the center of that line segment is selected. This would (drastically) increase speed in this special case and would not require to increase the precision in the general case.

I do like the idea of styling the subset and set labels differently. [...] Matploltib has a similar issue with axes labels and tick labels. They use font size as the distinguishing factor

That looks good in your examples. In my example, there is the problem that the label for $A$ has to be placed inside $A^*$... A radically different approach would be to annotate the set colors in a figure legend.

For both colored set labels (suggested earlier) and legends, it is necessary to access the set colors. However, they are currently not stored in an attribute but only used in SetDiagram.__init__ and then forgot. Would you be open to changing that? (That would facilitate the case when the default value is used.)

paulbrodersen commented 1 month ago

In my example, there is the problem that the label for A has to be placed inside A ∗ ...

I haven't messed with the set label placement, yet. It obviously still needs work, despite the difference in styling to make set labels less similar to subset labels.

paulbrodersen commented 1 month ago

Also, when just clamping y=0, it does not make a difference, if the origins where initially placed on a circle or on a line.

I guess during the first iteration, the circles are moved onto the x-axis and the remain there.

paulbrodersen commented 1 month ago

Or would it make sense to have POLYLABEL_TOLERANCE as a class attribute of SetDiagram? Then, one would not have to import it separately from a private sub-module.

I have made polylabel_tolerance an argument of _draw_subset_labels, with a default value of 0.01. The default can thus be changed by subclassing any of the diagram classes.

import matplotlib.pyplot as plt
from matplotlib_set_diagrams import EulerDiagram

class MyCustomEulerDiagram(EulerDiagram):
    def _draw_subset_labels(
        self, subset_labels, subset_geometries, subset_colors, ax, polylabel_tolerance=1.):
        return super()._draw_subset_labels(
            subset_labels, subset_geometries, subset_colors, ax, polylabel_tolerance)

subset_sizes = {
    (1, 0, 0) : 1,
    (1, 1, 0) : 1,
    (1, 1, 1) : 1,
    (0, 1, 0) : 1,
    (0, 1, 1) : 1,
    (0, 0, 1) : 0,
}

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.set_title("Strict polylabel tolerance")
EulerDiagram(subset_sizes, ax=ax1)
ax2.set_title("Relaxed polylabel tolerance")
MyCustomEulerDiagram(subset_sizes, ax=ax2)
plt.show()

Figure_1

I am still hesitant to make it an argument in the class initialization, as I suspect 0.01 is good enough for most cases, while explaining to users what the parameter does and how to choose a good value would be very involved.

moi90 commented 4 weeks ago

This sounds very good to me!