proplot-dev / proplot

🎨 A succinct matplotlib wrapper for making beautiful, publication-quality graphics
https://proplot.readthedocs.io
MIT License
1.07k stars 96 forks source link

Support for plotting distribution with different number of samples #426

Closed reemagit closed 1 year ago

reemagit commented 1 year ago

Hello,

I haven't found a way to produce boxplots where boxes describe populations with different sample sizes. As far as I understood, the problem is that numpy arrays do not support non-rectangular data.

E.g. with standard matplotlib it is possible to plot:

data = [ (2,3,4), (1,2), (1,2,3,4,5,6,7) ]
plt.boxplot(data)

As far as I know, there is no way to do the same with proplot.

My only workaround has been to define a function that converts a list of lists in a rectangular array with size (NxM), where M are the number of boxes, and N is the maximum length across the list elements. For each column, the residual elements are filled with np.nan. This seems to work because proplot filters the bad values when computing the statistics. This is a draft of the function:

def convert_to_proplot(list_of_lists):
    maxval = max([len(elem) for elem in list_of_lists])
    out = np.empty((maxval, len(list_of_lists)))
    out[:] = np.nan
    for i,elem in enumerate(list_of_lists):
        out[:len(elem),i] = elem
    return out

Would it be possible to have proplot handle non-rectangular data? Alternatively, would it be possible to have within the proplot package a function similar to the one above to quickly convert non-rectangular data to rectangular data?

Thank you in advance. Enrico

lukelbd commented 1 year ago

Hi Enrico, this should already be supported in both proplot/matplotlib. It's kind of weird to wrap your head around, but basically numpy supports defining "ragged" nested arrays where each object is itself a 1D array. In your example, the idea would be to define a 3-element array, each of which contains a tuple from your list. Works as follows:

data = np.array([(2, 3, 4), (1, 2), (1, 2, 3, 4, 5, 6, 7)], dtype=object)
fig, ax = pplt.subplots()
ax.boxplot(data)

iTerm2 vZHqwd tmphwb419q8

Also simply passing your nested list to ax.boxplot() without the np.array(..., dtype=object) should work, but may raise the following warning (may try to suppress this warning in future versions):

/Users/ldavis/mambaforge/lib/python3.10/site-packages/numpy/core/shape_base.py:65:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple
of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.
If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
reemagit commented 1 year ago

Thanks for the answer. I tried the code you provided:

data = np.array([(2, 3, 4), (1, 2), (1, 2, 3, 4, 5, 6, 7)], dtype=object)
fig, ax = pplt.subplots()
ax.boxplot(data)

but it yields the error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Is it a problem only on my side?

lukelbd commented 1 year ago

Hmm I think you might be using an older version? What does print(pplt.__version__) show? You can install the "dev" version (includes newest/unreleased changes) using pip install git+https://github.com/proplot-dev/proplot.git.

reemagit commented 1 year ago

my version was 0.9.7. Now I installed the dev version and it shows that it's version 0.9.5.post358 (is it expected that the version number is lower in a dev install?). When I run the same code, now I get an exception from function_base.py of numpy:

_4559 def _lerp(a, b, t, out=None): 4560 """ 4561 Compute the linear interpolation weighted by gamma on each point of 4562 two same shape array. (...) 4571 Output array. 4572 """ -> 4573 diff_b_a = subtract(b, a) 4574 # asanyarray is a stop-gap until gh-13105 4575 lerp_interpolation = asanyarray(add(a, diff_b_a * t, out=out))

TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'_

lukelbd commented 1 year ago

Yeah in this case -- 0.9.7 is built from a separate branch off of the 0.9.5 commit, it doesn't live on the main branch. You still have the latest version -- 0.9.7 is equivalent to 0.9.5, but enforces a maximum matplotlib version when you install.

It looks like this is a bug in your function. Add a = np.asarray(a) and b = np.asarray(b) in the first two lines of your function. Let me know if you have any other proplot-specific issues.

reemagit commented 5 months ago

Hello, I am re-opening the issue because I tested your code:

data = np.array([(2, 3, 4), (1, 2), (1, 2, 3, 4, 5, 6, 7)], dtype=object)
fig, ax = pplt.subplots()
ax.boxplot(data)

and I get an error that seems related to the masked_invalid function that proplot calls to mask invalid values.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[471], line 3
      1 data = np.array([(2, 3, 4), (1, 2), (1, 2, 3, 4, 5, 6, 7)], dtype=object)
      2 fig, ax = pplt.subplots()
----> 3 ax.boxplot(data)

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/proplot/internals/process.py:284, in _preprocess_args.<locals>.decorator.<locals>._redirect_or_standardize(self, *args, **kwargs)
    281             ureg.setup_matplotlib(True)
    283 # Call main function
--> 284 return func(self, *args, **kwargs)

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/proplot/axes/plot.py:3618, in PlotAxes.boxplot(self, *args, **kwargs)
   3614 """
   3615 %(plot.boxplot)s
   3616 """
   3617 kwargs = _parse_vert(default_vert=True, **kwargs)
-> 3618 return self._apply_boxplot(*args, **kwargs)

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/proplot/axes/plot.py:3551, in PlotAxes._apply_boxplot(self, x, y, mean, means, vert, fill, filled, marker, markersize, **kwargs)
   3549 if means:
   3550     kw['showmeans'] = kw['meanline'] = True
-> 3551 y = process._dist_clean(y)
   3552 artists = self._plot_native('boxplot', y, vert=vert, **kw)
   3553 artists = artists or {}  # necessary?

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/proplot/internals/process.py:298, in _dist_clean(distribution)
    296 if distribution.ndim == 1:
    297     distribution = distribution[:, None]
--> 298 distribution, units = _to_masked_array(distribution)  # no copy needed
    299 distribution = tuple(
    300     distribution[..., i].compressed() for i in range(distribution.shape[-1])
    301 )
    302 if units is not None:

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/proplot/internals/process.py:143, in _to_masked_array(data, copy)
    141 if ndarray is not Quantity and isinstance(data, Quantity):
    142     data, units = data.magnitude, data.units
--> 143 data = ma.masked_invalid(data, copy=copy)
    144 if np.issubdtype(data.dtype, int):
    145     data = data.astype(float)

File ~/.conda/envs/polvcopd2/lib/python3.9/site-packages/numpy/ma/core.py:2360, in masked_invalid(a, copy)
   2333 """
   2334 Mask an array where invalid values occur (NaNs or infs).
   2335 
   (...)
   2357 
   2358 """
   2359 a = np.array(a, copy=False, subok=True)
-> 2360 res = masked_where(~(np.isfinite(a)), a, copy=copy)
   2361 # masked_invalid previously never returned nomask as a mask and doing so
   2362 # threw off matplotlib (gh-22842).  So use shrink=False:
   2363 if res._mask is nomask:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

While this wouldn't be a proplot issue per se, as far as I understood the problems comes from masked_invalid using the numpy's isfinite function, which expects a homogeneous array. I found some threads where people had this problem with np.isfinite(), and the suggestions were essentially to cast the array to float (see for example here). But in ragged arrays this can't be done, and I've read somewhere that masking ragged arrays is not well supported because they are tricky.

Would there be a workaround to this issue? Alternatively, what is a version of numpy that works as expected?

I am on the latest numpy (1.26) and proplot (0.9.7) versions.

Thank you very much.