pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.59k stars 17.9k forks source link

DOC: Categorical sort_values and sort Documentation #12785

Closed gfyoung closed 8 years ago

gfyoung commented 8 years ago

In categorical.py, we enforce the fact that self must be ordered when calling min or max. However, self can be unordered when calling sort_values. This doesn't make sense in my mind, for if you can sort the values for unordered self, then I can then find a minimum value of self. The same comment applies to argsort as well.

jreback commented 8 years ago

prob an overight, should be using check_for_ordered(..) internally there.

gfyoung commented 8 years ago

Given how straightforward the PR is, would it be for v0.18.1 or v0.19.0?

jreback commented 8 years ago

its a bug, why would it not be 0.18.1?

gfyoung commented 8 years ago

Just wanted to double check. :smile:

jorisvandenbossche commented 8 years ago

Are we sure this is a bug? We have had a lot of discussions about the API, and certainly also about the ordered implication, min() raising etc. But I can't really remember if we discussed this specific issue.

In any case, I think it can be quite annoying if you cannot sort values (eg to just group them). And as a comparison, R does allow to sort, but raises for min() in case of unordered factors.

jreback commented 8 years ago

you could be right @jorisvandenbossche, I don't really remember.

cc @janschulz

gfyoung commented 8 years ago

@jorisvandenbossche : Is calling .as_ordered() that problematic? How about my explanation above?

@Everyone : Regardless of whether this is a bug or not, something will have to give. The documentation insists that argsort and sort can only be called on ordered Categorical objects, but the internals do not match.

jankatins commented 8 years ago

That was a consious descision sometime after implementing categoricals (and I just saw that the docstring was not changed: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L205)

commit is here: https://github.com/pydata/pandas/commit/87fec5b527dcbb2dd99592e2d7e298cda5ac225d

jreback commented 8 years ago

https://github.com/pydata/pandas/pull/9347 and #9611

jankatins commented 8 years ago

Yep, I think it started here: https://github.com/pydata/pandas/pull/9611#issuecomment-77928273

jorisvandenbossche commented 8 years ago

But in the end, it was changed here: https://github.com/pydata/pandas/pull/9622 (@janschulz you can see that in the commit (below the commit message))

gfyoung commented 8 years ago

Perhaps refactoring groupby would be too difficult, but IMHO don't agree that the convenience is worth compromising the overall logical consistency of the interface. I suspect there should exist some way to refactor groupby so that the categorical columns can be ordered internally.

gfyoung commented 8 years ago

What are people's thoughts on this? Should we stick with the current functionality, which is consistent with R but seems to be logically inconsistent, or should it be enforced?

jorisvandenbossche commented 8 years ago

Personally, I would leave it as is. I also think it makes 'some' sense to be able to sort categoricals. To make the analogy, if you have colored cubes, you can 'sort' these to form groups of all cubes of the same color, but you cannot give an 'order' to them. Of course we use the sort method for both 'sort' and 'order' meaning in other cases, which makes it a bit confusing.

Anyway, if we do the above, the docs of sort_values should be updated. And argsort does currently not even work on a categorical series:

In [2]: s = pd.Series(['a', 'b', 'a', 'c'], dtype='category')

In [5]: s.argsort()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bb845bcfbc4f> in <module>()
----> 1 s.argsort()

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.py in argsort(self, axis
, kind, order)
   1848         else:
   1849             return self._constructor(
-> 1850                 np.argsort(values, kind=kind), index=self.index,
   1851                 dtype='int64').__finalize__(self)
   1852

C:\Anaconda\envs\devel\lib\site-packages\numpy\core\fromnumeric.pyc in argsort(a
, axis, kind, order)
    906     except AttributeError:
    907         return _wrapit(a, 'argsort', axis, kind, order)
--> 908     return argsort(axis, kind, order)
    909
    910

TypeError: argsort() takes at most 2 arguments (4 given)

Another numpy compat issue .. :-)

gfyoung commented 8 years ago

@jorisvandenbossche : Don't worry about the numpy compat issue. I've got that one covered. :smile: But regarding the semantics of what sort means, okay, that is beginning to make more sense in my head. I agree at the very least that the documentation should make that clearer because that meaning wasn't the first one that came to my head.

gfyoung commented 8 years ago

b>@Everyone</b: I think if it is just a semantics thing, it's just a DOC-change PR then (not an API one as I had originally titled it)?

jreback commented 8 years ago

yes why don't u update doc string and docs as needed to be more clear

gfyoung commented 8 years ago

I have generalized this issue to just questions about Categorical.sort and Categorical.sort_values since this is another question about semantics and wording. In the documentation, the wording does not make any mention of na_position being dependent on ascending = False (i.e. it is only respected provided ascending = False). Just curious, why is that the case?

jorisvandenbossche commented 8 years ago

@gfyoung Because having good docs is hard and needs many eyes. However, for me the question in this case is more: why is na_position only implemented for ascending=False?

gfyoung commented 8 years ago

@jorisvandenbossche : Ah, that's what I actually meant. Sorry, the "it" was a little vague! My question is: why is na_position even dependent on ascending at all?

jreback commented 8 years ago

@gfyoung na_position has to do with the nan sorting, IOW, are nans first or last. So it is tied to ascending by-definition. This is all doced for Series & Index I believe, many not for Categorical. this is why we use shared_docs as much as possible (though not in this case because it wasn't done, though it should be done)

gfyoung commented 8 years ago

@jreback : Does first and last mean that nan is the "largest" or "smallest" element in the _codes? In that case, I can understand the dependence. Nevertheless, it's not being respected when ascending = True, so something has to give here. The documentation at the very least has to be updated to reflect that dependence.

>>> c = pd.Categorical([np.nan, 2, 2, np.nan, 5])
>>> c.sort_values(ascending=True, na_position='first')
# expected : [2.0, 2.0, 5.0, NaN, NaN]
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]
>>> c.sort_values(ascending=True, na_position='last')
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]
jreback commented 8 years ago
In [6]: s = Series([np.nan, 2, 2, np.nan, 5])

In [7]: s.sort_values(ascending=True, na_position='first')
Out[7]: 
0    NaN
3    NaN
1    2.0
2    2.0
4    5.0
dtype: float64

In [8]: s.sort_values(ascending=True, na_position='last')
Out[8]: 
1    2.0
2    2.0
4    5.0
0    NaN
3    NaN
dtype: float64

In [9]: s.sort_values(ascending=False, na_position='first')
Out[9]: 
0    NaN
3    NaN
4    5.0
2    2.0
1    2.0
dtype: float64

In [10]: s.sort_values(ascending=False, na_position='last')
Out[10]: 
4    5.0
2    2.0
1    2.0
0    NaN
3    NaN
dtype: float64
gfyoung commented 8 years ago

@jreback : Ah, okay, so it makes sense to align with the Series API, but observe that na_position is completely independent of ascending.

jreback commented 8 years ago

yep, misspoke, its independent, but would like to tie Series,Index,Categorical doc-strings at the least together

gfyoung commented 8 years ago

@jreback : IMO the Categorical documentation should be separate because the meaning of sort is not the same as the other two per the discussion above.