matplotlib / matplotlib

matplotlib: plotting with Python
https://matplotlib.org/stable/
19.94k stars 7.56k forks source link

Autoscaling has fundamental problems #7413

Closed efiring closed 3 years ago

efiring commented 7 years ago

6915 brings to light two problems with autoscaling:

1) It looks very inefficient: every plotting method in _axes adds an artist to the axes and then calls autoscale_view, occasionally with arguments. autoscale_view then does a complete autoscaling operation, going through all of the artists that have been added up to that point. Logically, it seems like the autoscaling should be done only before a draw operation, not every time an artist is added.

2) Beyond the apparent inefficiency, it doesn't work right for collections. add_collection calls self.update_datalim(collection.get_datalim(self.transData)) to get dataLim. This uses the present transData to calculate the size in data units of objects that have sizes and/or positions that may be specified in screen or axes units. Then the subsequent call to autoscale_view uses those positions to modify the view limits. But this changes transData so that the intended result cannot be achieved--when drawn, the sizes and locations in data units will not be what they were calculated to be when the view limits were set. The mismatch will grow as additional artists are added, each one potentially changing the data limits and the view limits. Usually we get away with this with no one noticing, but not always. #6915 shows that subsequently changing the scale of an axis, e.g. linear to log, can wreck the plot.

tacaswell commented 7 years ago

The first could be fixed by adding (yet more) state to the Axes in the from of a 'autoscale_the_nexttime_you_draw` flag? I suspect that the 'autoscale in plot method' makes sense from the POV of interactive use where almost every command is followed by a draw anyway.

The second sounds like we need to clear a cache when ever we change the scale (and maybe the view limits?).

efiring commented 7 years ago

No, I think the second is a much more fundamental problem--more along the lines of "you can't get there from here". A constraint engine might be needed to get it right. I need to spend more time thinking about it to be sure, though.

anntzer commented 7 years ago

Shouldn't the autoscaling "simply" be done at draw time (if autoscaling is on at that time, of course)? (In interactive mode, each command is followed by a draw, so it's just the same.) At that time, you know what your scale is, so you can 1) map each artist to axes coordinates -- including, say, collections whose offset is in data coordinates but whose own shape is effectively in axes coordinates. 2) find the limits and apply the margins (note that we also thus drop the need to hacks such as minpos (cf #7410) and can properly support e.g. logit scale (https://github.com/matplotlib/matplotlib/pull/3753#issuecomment-62016765)).

I think this should solve all the issues? (Not that it'd be that easy to implement such a large refactor.)

efiring commented 7 years ago

Yes, that's part of my point: draw time seems like the right time, so the strategy would be to accumulate all necessary information along the way, and then use all of it at once. I think this still requires an algorithm change, though; I'm not sure that the present update_datalim is doing the right thing, given that it is using the existing transData to calculate sizes in data coordinates, which are then used to modify the dataLim and transData, at which point the sizes in data coordinates are no longer what they were assumed to be. It seems like either an explicit inverse calculation or an iterative approach is needed. But I might be confused.

anntzer commented 7 years ago

No, my strategy is to not accumulate any information along the way. You simply wait until draw time when the correct transData is known, and do all the calculations at that point (this is assuming of course that the limits have not been explicitly set).

You will then need an explicit inverse when converting the limits-including-margins back to data space, but that's not a problem as scales are explicitly invertible.

efiring commented 7 years ago

I think you are misunderstanding what I am saying, and the nature of the problem. It is a nonlinear minimization problem. Mathematically, it cannot be solved with a simple single-pass algorithm like the present one. I don't have time to write this out in more detail right now, but we can return to it.

anntzer commented 7 years ago

OK, I'll wait for your writeup then.

efiring commented 7 years ago

Here is an attempt to show the structure of the problem for a single axis. It is a hybrid of math and code notation.

This problem needs to be solved simultaneously for all N plot elements involved in the autoscaling. I think that the only way to do it is iteratively. In the special case where allS[i] == 0, the problem is easier because both sides can be inverse-transformed (x[i] = F(X[i], x0, x1)), so the transform is only applied to the right hand side (scalars), and the left-hand side can be evaluated to a scalar. Then the conditions collapse to two equations in two unknowns, and iteration is not required. Alternatively, we note that because the transform has an inverse, the extreme values of x[i] occur at the same two indices as the extreme values of X[i], again collapsing the min/max conditions down to two equations in two unknowns.

I think the solution to this problem for now is to stop trying to be so ambitious. Let's disable the present non-functional attempt to solve the optimization, and instead rely on the simple mechanism of specified margins to reduce the incidence of cases where autoscaling fails to include all parts of all symbols.

tacaswell commented 7 years ago

I am :+1: on doing margins completely in dataspace and accepting clipped symbols.

How much of this logic can we push down into the scales / is pushed into the scales? If we could delegate all of the margin computation there it would greatly reduce how badly we have to special case for log / symlog / etc and would let third-party scales (attn @phobson ) deal with margins (up to just ignoring them) in a way that makes the most sense for that scale.

anntzer commented 7 years ago

I don't think the situation is that bad.

Indeed, the transform f is not arbitrary; unless I am mistaken, it is always of the form X[i] = (g(x[i]) - g(x0)) / (g(x1) - g(x0)) where g does not depend on the bounds (e.g. g(x) = log(x) for log scale).

Thus, let us write y[i] = g(x[i]), m = g(x0), t = 1 / (g(x1) - g(x0)). As we know the inverse of g, our problem is equivalent to finding m and t so that

Note that the t * m term is the same everywhere, so we can get rid of it by subtracting the equations, and first solve the equation in t, (max (t * y[i] + S[i])) - (min (t * y[i] - S[i])) = 1 - M1 - M2 (3) (we can later retrieve m by plugging in t in (1).

To be honest, I had missed the fact that different elements could have different S; in my mind, equation (3) simplified further to max (t * y[i]) - min (t * y[i]) = 1 - M1 - M2 which can immediately be solved simply by finding the lowest and highest ys.

If we want to keep things simple, as suggested by @tacaswell, I would rather set S to be the largest of all sizes; this will ensure that symbols are never clipped, and will compute the margins correctly in the common case where all symbols have the same size.

Let's come back to the general case of eq. (3). The expression on the right, seen as a function of t, is a piecewise linear function; the max and the min terms both have at most N - 1 "break points"; their difference has at most 2 (N - 1) break points; if we can find these break points (with the knowledge of which affine functions is extremal on each interval), we can then solve the linear system on each interval and finally find the solution.

A quick googling suggests that these break points can be found in O(N log N) by a variant of merge sort (Handbook of combinatorial optimization, Du & Pardalos, p455) (this looks like it should be a "classical" optimization problem anyways).

efiring commented 7 years ago

1) Implementing any version of this, whether with variable S or with a maximal S, requires a single calculation done at draw time with all x and S (many of which may be zero) stored. (Yes, it can be broken up into two parts, one for the cases of non-zero S and another for the trivial case of zero S, but this doesn't change the need to at least gather all of the non-zero S cases for processing in one go.)

2) The "use the maximum S" strategy would give reasonable results in most uses of scatter, but would fail if the user included a large patch (via Patch or a Collection) in a data location not near the data limits.

3) Anything along these lines is much too complicated to try to get into v2.x. For 2.0 we need to either accept the present flawed algorithms, or back out their usage--everywhere, or perhaps only in some instances.

4) Even for later versions, we need to think in terms of cost/benefit, risk/reward, for introducing fancier algorithms. It's worth considering in the larger context of major refactoring, unification of scales and norms, use of a layout engine, etc. But it might be relatively low priority, and it might be in incremental improvement that should come after several others.

5) For now, all of this is a distraction from what we desperately need to do: release a "good enough" 2.0, and then some real improvements in a 2.1, again in a 2.2, etc.

anntzer commented 7 years ago

I agree with pushing out a "good enough" release. What about setting S = min(S)? This would, again, cover the case where all the markers have the same size (which is likely by far the most common case), with "relatively" little complexity.

efiring commented 7 years ago

That still requires introducing a new algorithm and strategy for handling the autoscaling, pushing it to the draw stage. (Probably this should be done anyway, but it's not an immediate priority.) And much of the point of scatter is to handle cases where size is variable. Same with quiver. Variable size is a common case, and it is the reason the present flawed approach was taken.

QuLogic commented 7 years ago

Indeed, the transform f is not arbitrary; unless I am mistaken, it is always of the form X[i] = (g(x[i]) - g(x0)) / (g(x1) - g(x0)) where g does not depend on the bounds (e.g. g(x) = log(x) for log scale).

Surely not true for polar plots or projection examples or Cartopy?

anntzer commented 7 years ago

I think our discussion is only applicable to scales and not projections, as per the definition of http://matplotlib.org/devel/add_new_projection.html:

Separable transformations, working on a single dimension, are called “scales”, and non-separable transformations, that handle data in two or more dimensions at a time, are called “projections”.

I'm not sure things can be done in a non ad hoc manner for arbitrary projections; for example, you'll probably generally want your polar plot to cover the entire range 0..2pi in theta even if all the data is in a smaller range (or, if you want to restrict yourself to a smaller range, it is probably a range that you explicitly put in.

efiring commented 7 years ago

Possible partial fix for 2.0:

This is just a quick thought at this point.

efiring commented 7 years ago

Looking again at this, I conclude we should do nothing about it for v2.0. Even the partial fix sketched above would be somewhat intrusive, requiring adding attributes to collections to track whether and how to handle them in relim. As far as I can see, the present algorithm has been in place for a very long time, so it is not a matter of breakage by 2.x. I will move the milestone; I think that even 2.1 is optimistic for coming up with a good, clean solution.

jklymak commented 5 years ago

I closed it, but https://github.com/matplotlib/matplotlib/issues/13639 and https://github.com/mwaskom/seaborn/issues/1641 are great examples of the second part of this problem.

jklymak commented 5 years ago

13640 tries to fix one of the issues;

I'm a little befuddled about what the auto-lim behaviour of scatter is supposed to be; isn't it supposed to make the limits big enough to see the whole marker? Or is that part of the bug?

import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 1)
axs[0].scatter([0, 1], [0, 1], s=10)
axs[1].scatter([0, 1], [0, 1], s=3000)

plt.show()

Figure_3

adeak commented 5 years ago

@jklymak I can't say I understand the full extent of the issue, but I think scatter should account for the marker size and it doesn't (or not fully). Quoting @efiring on a predecessor of this issue:

The fundamental problem with collections is that they operate with two or more transforms, so that in the scatter case, for example, they can draw objects of a given physical size at locations given in data coordinates. Autoscaling therefore requires figuring out the data range required for the objects to fit, and this depends on the data transform--and in a particularly tricky way when it is a log transform. [...] It is handled correctly by plot because that yields a Line2D object which is much simpler than a collection, and has a hook that triggers the necessary recalculations. The assumption is that marker sizes are small enough that they don't have to be taken into account in the autoscaling.

But if I understand correctly a few Monte Carlo simulations done showed that enlarging the markers affects the choice of (wrong) default limits (cf. one with the other).

Finally, I don't know how much has changed about all this in the past 2 years...

jklymak commented 5 years ago

Another way of expressing the problem: if I make a marker that is 10" wide and my figure is only 5" wide, this algorithm is doomed to failure because no adjusting the xlim of the axes will ever get that marker to fit in the axes. If you throw a constraint solver at this, it will just say you have incompatible constraints and die.

There are presently only four collections that use autolims, so far as I can see: scatter, quiver, barb, and brokenbar

There are two separate cases here: 1) the collection offsets and shapes are drawn in data co-ordinates (quiver, barb, brokenbar) 2) the collection offsets are in data units, but the shapes are sized in physical units.

I think 1) is easy - you just find the min and max in data space and you have your lims. 2) I think we should just assume the makers are normal sized, and allow big ones to clip (i.e. just use the offsets to get the auto limits.

I think collections.get_datalim could check if transform (for the shapes) is transData or not and decide whether to use it. Similarly if "offset" is not transData it should not try to autolim that object. Thats a simple solution that will return consistent results, and not try to do something magical which is often impossible and when it bails just gives nonsense...

ImportanceOfBeingErnest commented 5 years ago

I think the complete marker is supposed to be part of the data limits. However, get_datalim only seems to work if you draw the figure first. That is, the following works fine better, due to the figure being drawn before obtaining the data limits

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

sc = ax.scatter([0, 1], [0, 1], s=8000)

fig.canvas.draw()
ax.update_datalim(sc.get_datalim(ax.transData))
ax.autoscale()

plt.show()

image

However, even in this case the marker is only shown completely due to the margins adding some extra space.

jklymak commented 5 years ago

Right, and if you keep calling it, xlim keeps changing, converging on a solution if there is a solution...

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

sc = ax.scatter([0, 1], [0, 1], s=8000)

xmin = []
for i in range(20):
    fig.canvas.draw()
    ax.update_datalim(sc.get_datalim(ax.transData))
    ax.autoscale()
    print(ax.get_xlim())
    xmin += [ax.get_xlim()[0]]
(-0.2151878284989904, 1.2151878284989903)
(-0.24703510991539251, 1.247035109915392)
(-0.2558090743822613, 1.255809074382261)
(-0.2582263121822271, 1.2582263121822272)
(-0.2588922641134946, 1.258892264113495)
(-0.25907573467429973, 1.2590757346742996)
(-0.25912628103523305, 1.259126281035233)
(-0.259140206618675, 1.2591402066186748)
(-0.25914404313371997, 1.2591440431337202)

If there is no solution (s=400000) the limits never converge...

(-1.8256753925786404, 2.825675392578641)
(-4.619776234706715, 5.619776234706715)
(-10.109973046555703, 11.109973046555705)
(-20.89779450896202, 21.89779450896202)
(-42.095047702798034, 43.09504770279803)
(-83.74604689011363, 84.74604689011363)
...
(-8.246981079288803e+28, 8.246981079288803e+28)
(-1.620469402758045e+29, 1.620469402758045e+29)
(-3.184099805769735e+29, 3.184099805769735e+29)
anntzer commented 5 years ago

This is slowly getting worked upon (#13593 and #13642 will be in 3.2 and are already likely to break everyone's code :p); I'm sure more will happen in the 3.3 timeframe. Remilestoning.

jklymak commented 3 years ago

We've done substantial work adjacent to this, in particular #13593 and #13642, mentioned above. I'm going to close here, and if we feel there are still actionable issues, we should make a clean issue report with reproducible examples of the problems.