hgrecco / pint-pandas

Pandas support for pint
Other
172 stars 42 forks source link

docs #191

Closed andrewgsavage closed 1 year ago

andrewgsavage commented 1 year ago

https://pint-pandas.readthedocs.io/en/docs/

MichaelTiemannOSC commented 1 year ago

Would it be useful to document the general implications and idioms of using ExtensionArrays? I'm gradually learning that when converting an ndarray into a PandasArray, np.nan becomes NA (and vice-versa in the other direction). Helping people understand how NA and np.nan play inside of Quantities, and the most efficient idioms for dealing with them correctly (pd.isna vs. np.isnan) could be very helpful. I could help write it if you tell me where you think it belongs.

andrewgsavage commented 1 year ago

I don't see many issues relating to nans so I'm wondering if you're encountering this because you're doing less typical workflows. It would be worth making an issue with your findings to understand where they're coming from.

I expect it to be sometihng to do with PintArrays either having PandasArrays or some form of np.array holding the values. I wonder if a better way for uncertainties would be to create an UncertaintyArray that the PintArray can use for values?

MichaelTiemannOSC commented 1 year ago

You are one step ahead of me. Last night I put my finger on what seems to be the last problem in my own test cases (the pint_pandas test cases don't trip it). When pd.merge needs to fill unmatched values with NaNs, it was creating invalid ndarrays due to the NaN value I've created. I'll write up findings when I have more to report, but I think I have a handle on a way forward. Thanks!

MichaelTiemannOSC commented 1 year ago

I've made a lot of progress working with pd.NA and reading through dtype("O") and validating values as UFLoat. I think it might be more elegant to create and use an UncertaintyArray, but I want to try to finish what I've almost got working, then discuss how to possibly make it more elegant with an UncertaintyArray type.

The test cases I'm looking at right now are the complex128 test cases, which, because they are actually EAs, and not ComplexArray types, are tickling what I've done in unexpected ways. Which is a good way to ensure the robustness of what I'm doing, rather than hiding behind a fresh type (I think).

andrewgsavage commented 1 year ago

has anyone had a chance to look at this? you can view the docs here https://pint-pandas.readthedocs.io/en/docs/

MichaelTiemannOSC commented 1 year ago

I just submitted some fresh changes to enable testing of complex128 for Pandas 2.1.0rc0+96. Over the past few weeks the pandas team have progressively improved underlying code so that as of today, essentially no special adaptations are required.

I still need to see if I can similarly simplify uncertainties, but I think that when 2.1 comes out things are going to be a lot simpler (both to document and implement).

MichaelTiemannOSC commented 1 year ago

Is there a convenient way I can leave comments inline? Like comments on a pull request?

andrewgsavage commented 1 year ago

Is there a convenient way I can leave comments inline? Like comments on a pull request?

in the .rst files that have been added

MichaelTiemannOSC commented 1 year ago

OK, so I cloned the repo and made a change to getting started, which in my version reads:

The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature).  The Pint package provides a ri\
ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities.  Pint-pandas provides PintArray, a \
Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.

Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well.  Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas\
 data.  A 1-dimensional Pandas Series can use a PintArray to hold its values.  Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.\
  If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units.  But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti\
es (or raw data if the column values don't have units).  All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose \
information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values.  To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co\
ntain the heterogeneous data when necessary

The reason I'm telling you this in a comment and not with a PR is because I DON'T UNDERSTAND GITHUB!!! I really thought I did the right things in terms of cloning, forking, editing, etc., but GitHub insists on doing things most unintuitive to me. If I can get some help sorting out how to put my carefully placed andrewgsavage/pint-pandas repo into a properly described and defined git place that doesn't make it look like the twin of MichaelTiemannOSC/pint-pandas, I'd appreciate it. I do have hgrecco/pint and hgrecco/pint-pandas properly separated. I just somehow didn't say all the right magic when I tried to make a change relative to your repo as my upstream source.

andrewgsavage commented 1 year ago

you can add comments inline by going file changed, clicking a file, then clicking the blue + after hovering over a line addnig comments like that is fine

MichaelTiemannOSC commented 1 year ago

That's a good solution...

andrewgsavage commented 1 year ago

if you want to make changes in a PR, you'll make a branch under MichaelTiemannOSC/pint-pandas that tracks andrewgsavage/pint-pandas:docs, then make a PR to andrewgsavage/pint-pandas:docs (ie go to https://github.com/andrewgsavage/pint-pandas/pulls )

andrewgsavage commented 1 year ago

I'll add this bit:

The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature). The Pint package provides a ri\ ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities. Pint-pandas provides PintArray, a \ Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.

Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well. Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas\ data.

I think this bit is too in detail for the getting started section, but could fit elsewhere

A 1-dimensional Pandas Series can use a PintArray to hold its values. Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.\ If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units. But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti\ es (or raw data if the column values don't have units). All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose \ information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values. To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co\ ntain the heterogeneous data when necessary

MichaelTiemannOSC commented 1 year ago

Please pass through a spell-check first. I notice I misspelled mistakes!

andrewgsavage commented 1 year ago

I think this bit is too in detail for the getting started section, but could fit elsewhere An example would make this clearer and could go under common issues?

MichaelTiemannOSC commented 1 year ago

Plot twist: the next version of Pandas (2.1.1? 2.2.0?) will allow EAs to support 2d values, which means that the one-dimensional explanations I've given above will no longer be quite correct. Of course pint-pandas could make the decision that PintArrays are only ever one-dimensional, and we can clean up the text to say that, but we could also allow for the possibility that a whole 2-dimensional DataFrame has quantities, and that both rows and columns both allow not only showing quantified rows and columns, but both can have values set within them via .loc and .iloc while retaining their EA nature.

andrewgsavage commented 1 year ago

bors r+

bors[bot] commented 1 year ago

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here. For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.