Closed andrewgsavage closed 1 year ago
Would it be useful to document the general implications and idioms of using ExtensionArrays? I'm gradually learning that when converting an ndarray into a PandasArray, np.nan becomes NA (and vice-versa in the other direction). Helping people understand how NA and np.nan play inside of Quantities, and the most efficient idioms for dealing with them correctly (pd.isna vs. np.isnan) could be very helpful. I could help write it if you tell me where you think it belongs.
I don't see many issues relating to nans so I'm wondering if you're encountering this because you're doing less typical workflows. It would be worth making an issue with your findings to understand where they're coming from.
I expect it to be sometihng to do with PintArrays either having PandasArrays or some form of np.array holding the values. I wonder if a better way for uncertainties would be to create an UncertaintyArray that the PintArray can use for values?
You are one step ahead of me. Last night I put my finger on what seems to be the last problem in my own test cases (the pint_pandas test cases don't trip it). When pd.merge needs to fill unmatched values with NaNs, it was creating invalid ndarrays due to the NaN value I've created. I'll write up findings when I have more to report, but I think I have a handle on a way forward. Thanks!
I've made a lot of progress working with pd.NA and reading through dtype("O") and validating values as UFLoat. I think it might be more elegant to create and use an UncertaintyArray, but I want to try to finish what I've almost got working, then discuss how to possibly make it more elegant with an UncertaintyArray type.
The test cases I'm looking at right now are the complex128 test cases, which, because they are actually EAs, and not ComplexArray types, are tickling what I've done in unexpected ways. Which is a good way to ensure the robustness of what I'm doing, rather than hiding behind a fresh type (I think).
has anyone had a chance to look at this? you can view the docs here https://pint-pandas.readthedocs.io/en/docs/
I just submitted some fresh changes to enable testing of complex128 for Pandas 2.1.0rc0+96. Over the past few weeks the pandas team have progressively improved underlying code so that as of today, essentially no special adaptations are required.
I still need to see if I can similarly simplify uncertainties, but I think that when 2.1 comes out things are going to be a lot simpler (both to document and implement).
Is there a convenient way I can leave comments inline? Like comments on a pull request?
Is there a convenient way I can leave comments inline? Like comments on a pull request?
in the .rst files that have been added
OK, so I cloned the repo and made a change to getting started, which in my version reads:
The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature). The Pint package provides a ri\
ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities. Pint-pandas provides PintArray, a \
Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.
Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well. Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas\
data. A 1-dimensional Pandas Series can use a PintArray to hold its values. Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.\
If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units. But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti\
es (or raw data if the column values don't have units). All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose \
information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values. To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co\
ntain the heterogeneous data when necessary
The reason I'm telling you this in a comment and not with a PR is because I DON'T UNDERSTAND GITHUB!!! I really thought I did the right things in terms of cloning, forking, editing, etc., but GitHub insists on doing things most unintuitive to me. If I can get some help sorting out how to put my carefully placed andrewgsavage/pint-pandas repo into a properly described and defined git place that doesn't make it look like the twin of MichaelTiemannOSC/pint-pandas, I'd appreciate it. I do have hgrecco/pint and hgrecco/pint-pandas properly separated. I just somehow didn't say all the right magic when I tried to make a change relative to your repo as my upstream source.
you can add comments inline by going file changed, clicking a file, then clicking the blue + after hovering over a line addnig comments like that is fine
That's a good solution...
if you want to make changes in a PR, you'll make a branch under MichaelTiemannOSC/pint-pandas that tracks andrewgsavage/pint-pandas:docs, then make a PR to andrewgsavage/pint-pandas:docs (ie go to https://github.com/andrewgsavage/pint-pandas/pulls )
I'll add this bit:
The Pandas package provides powerful DataFrame and Series abstractions for dealing with numerical, temporal, categorical, string-based, and even user-defined data (using its ExtensionArray feature). The Pint package provides a ri\ ch and extensible vocabulary of units for constructing Quantities and an equally rich and extensible range of unit conversions to make it easy to perform unit-safe calculations using Quantities. Pint-pandas provides PintArray, a \ Pandas ExtensionArray that efficiently implements Pandas DataFrame and Series functionality as unit-aware operations where appropriate.
Those who have used Pint know well that good units discipline often catches not only simple mistkaes, but sometimes more fundamental errors as well. Pint-pandas can reveal similar errors when it comes to slicing and dicing Pandas\ data.
I think this bit is too in detail for the getting started section, but could fit elsewhere
A 1-dimensional Pandas Series can use a PintArray to hold its values. Columns in 2-dimensional Pandas DataFrame can contain PintArrays--with all the efficiency the ExtensionArray APIs provide, but rows are a special case.\ If all elements of the row have the same units, the row will be returned as Series backed by a PintArray with those units. But if the units are heterogeneous, the row will be returned as a Series consisting of discrete Quantiti\ es (or raw data if the column values don't have units). All Quantity data within such Series will follow Pint rules of unit conversions and will give error messages when units are not compatible, but some error messages may lose \ information as Pandas tries to align two incompatible Quantities to non-unitized magnitude values. To get the greatest benefit from Pint-pandas (and Pandas in general), make your columns from homogeneous data and let your rows co\ ntain the heterogeneous data when necessary
Please pass through a spell-check first. I notice I misspelled mistakes!
I think this bit is too in detail for the getting started section, but could fit elsewhere An example would make this clearer and could go under common issues?
Plot twist: the next version of Pandas (2.1.1? 2.2.0?) will allow EAs to support 2d values, which means that the one-dimensional explanations I've given above will no longer be quite correct. Of course pint-pandas could make the decision that PintArrays are only ever one-dimensional, and we can clean up the text to say that, but we could also allow for the possibility that a whole 2-dimensional DataFrame has quantities, and that both rows and columns both allow not only showing quantified rows and columns, but both can have values set within them via .loc and .iloc while retaining their EA nature.
bors r+
Build succeeded!
The publicly hosted instance of bors-ng is deprecated and will go away soon.
If you want to self-host your own instance, instructions are here. For more help, visit the forum.
If you want to switch to GitHub's built-in merge queue, visit their help page.
pre-commit run --all-files
with no errorshttps://pint-pandas.readthedocs.io/en/docs/