matplotlib / data-prototype

https://matplotlib.org/data-prototype
BSD 3-Clause "New" or "Revised" License
5 stars 4 forks source link

DOC: write out a prose narrative of the proposed design #14

Open tacaswell opened 1 year ago

tacaswell commented 1 year ago

I also discovered that mathjax does not work with singlehtml (it builds cleanl, adds the math tags etc but does not add mathjax to the output file) and I was getting local build issues due to the mpl-sphinx-theme.

We should decide if want to:

  1. want to not use any math in the docs
  2. switch to normal html (which it might be getting long enough we need to do that anyway)
  3. fix the mathjax extension to work with singlehtml
tacaswell commented 1 year ago

Other relevant links:

jklymak commented 1 year ago

This is super helpful for me.

A lot of this design has to do with holding numpy arrays, and being able to subsample them on the fly, and carry some meta information around about the numpy arrays. I think that is great and will be a huge step forward for us. Where I'm still unclear is our interface with other data types and units.

How do we tell downstream or adjacent libraries how we will decode their objects, both just to get the numpy array, and to do unit conversion? Specifying this seems pretty key to me, and maybe is beyond just our remit. Pandas, xarray, have things like obj.to_numpy(), obj.values, and I think they have __array__ representations. I think we largely expect np.asarray(obj) to work. But that kills units.

Unit support is confusing as well. I think the confusion comes from whether elements of arrays should have units and that is where we detect the dtype, or whether the array itself should have the units. I feel we should put our foot down: I think arrays should have units, not array elements. I think that leaves lists of pint objects out in the cold, and I'm not even sure it works with the jpl_units interface, but I pretty strongly feel there is no reason to have arrays with mixed units, and that parsing unit information from either the array or its elements is far too confusing.

However, that has the same issue in that we need a standard for how arrays hold their unit information, or how we can access it. Or we just natively support object arrays (categorical) and np.datetime64 arrays (dates) and let downstream libraries write their own converters like we do now. However, if we do that, do we also make the conversion interface the way the downstream libraries extract the numpy array? Then we don't need to guess how they turn their object into an array. Note, that we don't do this presently.

You have probably thought through some of the above, but I think it's hard because unlike much of the rest of this proposal, it involves the interface with other projects, and probably the conversation should start soon.

tacaswell commented 1 year ago

I think that is great and will be a huge step forward for us. Where I'm still unclear is our interface with other data types and units.

Everything goes under the DataContainer wrapper so it has a consistent API. We have not gotten to the design of the helper-functions to make fabricating these objects easy yet. I think one of the persistent issues with Matplotlib as it is now is that too much of the "lets make it easy to use" leaked down into the core of the library which makes it overly complicated and both harder to use and maintain in the long term.

Unit support is confusing as well. I think the confusion comes from whether elements of arrays should have units and that is where we detect the dtype, or whether the array itself should have the units.

I think this is primarily a problem with the auto-detection of units. Once we know how to handle a given input we should no care how the units are actually carried. That said, I have hopes that the numpy dytpe work will flatten much of the diversity in this space from our point of view.

jklymak commented 1 year ago

Everything goes under the DataContainer wrapper so it has a consistent API. We have not gotten to the design of the helper-functions to make fabricating these objects easy yet.

I guess thats what I'm asking: who makes the DataContainer? I guess the way we do things now is that we ask the converter to do it, so I assume this would be the same, but now the downstream library would provide the DataContainer implementation? Are they going to need to make a different Container for each plot type they want to support?