Handle lists and arrays as `x`, `y` in `density_scatter` and siblings

janosh commented 1 year ago

This simple example

import numpy as np

from pymatviz import density_scatter

arr = np.arange(5)
lst = list(range(5))

ax = density_scatter(x=arr, y=arr)
ax = density_scatter(x=lst, y=lst)

embarrassingly raises

File ~/dev/pymatviz/pymatviz/parity.py:43, in hist_density(x, y, df, sort, bins)
     39 x, y = df_to_arrays(df, x, y)
     41 data, x_e, y_e = np.histogram2d(x, y, bins=bins)
---> 43 zs = scipy.interpolate.interpn(
     44     (0.5 * (x_e[1:] + x_e[:-1]), 0.5 * (y_e[1:] + y_e[:-1])),
     45     data,
     46     np.vstack([x, y]).T,
     47     method="splinef2d",
     48     bounds_error=False,
     49 )
...
    611 result[idx_valid] = interp.ev(xi[idx_valid, 0], xi[idx_valid, 1])
--> 612 result[np.logical_not(idx_valid)] = fill_value
    614 return result.reshape(xi_shape[:-1])

ValueError: cannot convert float NaN to integer

DanielYang59 commented 8 months ago

I'm working on this.

DanielYang59 commented 8 months ago

Hi @janosh sorry for the long silence.

I'm thinking about cleaning up the code a little bit (starting with density_scatter and siblings, and then others if I have the energy).

What I want to do (in terms of dataset handling) is still allowing users to pass various types (ArrayLike, pd.DataFrame and more) into the plotter function, but internally convert them to a single data type (pd.DataFrame maybe?), to get rid of data processing codes inside the plotter and potential incompatibility issues like this.

This should not be hard (just a simple utility script should do the job) nor breaking. Can I have the permission to implement this change? And do you have any suggestions on this matter? Thanks!

janosh commented 8 months ago

Hi @janosh sorry for the long silence.

no worries at all.

Can I have the permission to implement this change?

of course, that would be much appreciated! 🙏

And do you have any suggestions on this matter?

yes! have a look at

https://github.com/janosh/pymatviz/blob/a2d29e473feee7cb2a60c02cc061aa754071c2fb/pymatviz/utils.py#L486-L496

which is used in density_scatter and many other plot functions to standardize input data

https://github.com/janosh/pymatviz/blob/a2d29e473feee7cb2a60c02cc061aa754071c2fb/pymatviz/parity.py#L115

DanielYang59 commented 8 months ago

yes! have a look at

https://github.com/janosh/pymatviz/blob/a2d29e473feee7cb2a60c02cc061aa754071c2fb/pymatviz/utils.py#L486-L496

Thanks a lot @janosh. Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array). In this case, which do you prefer? pd.DataFrame or np.ndarray or both (I personally prefer np.ndarray because it's more explicit and flat, and has efficient data processing support)? What do you think?

On top of all these technical things, I noticed despite all these beautiful plots, pymatviz seems to lack its documentation site (maybe I failed to find it?). This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

janosh commented 8 months ago

This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

yes, that has been on my todo list for a long time. the current docs are terrible!

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure. we could essentially copy all the cells in _generate_assets.py into separate files and write some docs/explanation around it.

i'd prefer if not to use Jupyter notebooks for this as i find them more annoying to work with than Python scripts. luckily VS Code Interactive Window can run Python scripts just like notebooks and also supports exporting cells and their output to HTML. we just need to find a way to script this functionality so that the HTML is updated whenever source files change.

Screenshot 2024-02-17 at 10 37 41

Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array)

i'm open to that. i'd prefer dataframes over arrays though as they have a more powerful API

DanielYang59 commented 8 months ago

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure.

I'm thinking just building separate a docs site with sphinx templates like "Read the Docs", I have some experience with that and I will start working on this soon (would open a separate PR aside from the data preprocessing topic )😄 .

i'd prefer if not to use Jupyter notebooks for this

I also find very long jupyter (like _generate_assets.py) could be hard to navigate through. Would be much better to separate assert/example generation scripts for each section (histogram/ptable and such). I would push a new PR once I finish the draft and iterate together. Thanks!

janosh commented 8 months ago

i don't think we need to start from scratch and i'm not a huge fan of sphinx tbh.

there are some preliminary example notebooks already up on the current docs page. e.g. see https://pymatviz.janosh.dev/notebooks/mp_bimodal_e_form.

you can get a list of them by hitting cmd + k to bring up the nav palette on the current docs. i think just converting those notebooks to python scripts and adding more of them would be great!

Screenshot 2024-02-17 at 11 07 14

janosh / pymatviz

Handle lists and arrays as `x`, `y` in `density_scatter` and siblings #81