Closed gfyoung closed 8 years ago
prob an overight, should be using check_for_ordered(..)
internally there.
Given how straightforward the PR is, would it be for v0.18.1
or v0.19.0
?
its a bug, why would it not be 0.18.1?
Just wanted to double check. :smile:
Are we sure this is a bug? We have had a lot of discussions about the API, and certainly also about the ordered implication, min() raising etc. But I can't really remember if we discussed this specific issue.
In any case, I think it can be quite annoying if you cannot sort values (eg to just group them).
And as a comparison, R does allow to sort, but raises for min()
in case of unordered factors.
you could be right @jorisvandenbossche, I don't really remember.
cc @janschulz
@jorisvandenbossche : Is calling .as_ordered()
that problematic? How about my explanation above?
@Everyone : Regardless of whether this is a bug or not, something will have to give. The documentation insists that argsort
and sort
can only be called on ordered Categorical
objects, but the internals do not match.
That was a consious descision sometime after implementing categoricals (and I just saw that the docstring was not changed: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L205)
commit is here: https://github.com/pydata/pandas/commit/87fec5b527dcbb2dd99592e2d7e298cda5ac225d
Yep, I think it started here: https://github.com/pydata/pandas/pull/9611#issuecomment-77928273
But in the end, it was changed here: https://github.com/pydata/pandas/pull/9622 (@janschulz you can see that in the commit (below the commit message))
Perhaps refactoring groupby
would be too difficult, but IMHO don't agree that the convenience is worth compromising the overall logical consistency of the interface. I suspect there should exist some way to refactor groupby so that the categorical columns can be ordered internally.
What are people's thoughts on this? Should we stick with the current functionality, which is consistent with R but seems to be logically inconsistent, or should it be enforced?
Personally, I would leave it as is.
I also think it makes 'some' sense to be able to sort categoricals. To make the analogy, if you have colored cubes, you can 'sort' these to form groups of all cubes of the same color, but you cannot give an 'order' to them. Of course we use the sort
method for both 'sort' and 'order' meaning in other cases, which makes it a bit confusing.
Anyway, if we do the above, the docs of sort_values
should be updated.
And argsort
does currently not even work on a categorical series:
In [2]: s = pd.Series(['a', 'b', 'a', 'c'], dtype='category')
In [5]: s.argsort()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-bb845bcfbc4f> in <module>()
----> 1 s.argsort()
c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.py in argsort(self, axis
, kind, order)
1848 else:
1849 return self._constructor(
-> 1850 np.argsort(values, kind=kind), index=self.index,
1851 dtype='int64').__finalize__(self)
1852
C:\Anaconda\envs\devel\lib\site-packages\numpy\core\fromnumeric.pyc in argsort(a
, axis, kind, order)
906 except AttributeError:
907 return _wrapit(a, 'argsort', axis, kind, order)
--> 908 return argsort(axis, kind, order)
909
910
TypeError: argsort() takes at most 2 arguments (4 given)
Another numpy compat issue .. :-)
@jorisvandenbossche : Don't worry about the numpy
compat issue. I've got that one covered. :smile: But regarding the semantics of what sort
means, okay, that is beginning to make more sense in my head. I agree at the very least that the documentation should make that clearer because that meaning wasn't the first one that came to my head.
b>@Everyone</b: I think if it is just a semantics thing, it's just a DOC-change PR then (not an API one as I had originally titled it)?
yes why don't u update doc string and docs as needed to be more clear
I have generalized this issue to just questions about Categorical.sort
and Categorical.sort_values
since this is another question about semantics and wording. In the documentation, the wording does not make any mention of na_position
being dependent on ascending = False
(i.e. it is only respected provided ascending = False
). Just curious, why is that the case?
@gfyoung Because having good docs is hard and needs many eyes.
However, for me the question in this case is more: why is na_position
only implemented for ascending=False
?
@jorisvandenbossche : Ah, that's what I actually meant. Sorry, the "it" was a little vague! My question is: why is na_position
even dependent on ascending
at all?
@gfyoung na_position
has to do with the nan sorting, IOW, are nans
first or last. So it is tied to ascending
by-definition. This is all doced for Series & Index I believe, many not for Categorical
. this is why we use shared_docs
as much as possible (though not in this case because it wasn't done, though it should be done)
@jreback : Does first
and last
mean that nan
is the "largest" or "smallest" element in the _codes
? In that case, I can understand the dependence. Nevertheless, it's not being respected when ascending = True
, so something has to give here. The documentation at the very least has to be updated to reflect that dependence.
>>> c = pd.Categorical([np.nan, 2, 2, np.nan, 5])
>>> c.sort_values(ascending=True, na_position='first')
# expected : [2.0, 2.0, 5.0, NaN, NaN]
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]
>>> c.sort_values(ascending=True, na_position='last')
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]
In [6]: s = Series([np.nan, 2, 2, np.nan, 5])
In [7]: s.sort_values(ascending=True, na_position='first')
Out[7]:
0 NaN
3 NaN
1 2.0
2 2.0
4 5.0
dtype: float64
In [8]: s.sort_values(ascending=True, na_position='last')
Out[8]:
1 2.0
2 2.0
4 5.0
0 NaN
3 NaN
dtype: float64
In [9]: s.sort_values(ascending=False, na_position='first')
Out[9]:
0 NaN
3 NaN
4 5.0
2 2.0
1 2.0
dtype: float64
In [10]: s.sort_values(ascending=False, na_position='last')
Out[10]:
4 5.0
2 2.0
1 2.0
0 NaN
3 NaN
dtype: float64
@jreback : Ah, okay, so it makes sense to align with the Series
API, but observe that na_position
is completely independent of ascending
.
yep, misspoke, its independent, but would like to tie Series,Index,Categorical
doc-strings at the least together
@jreback : IMO the Categorical
documentation should be separate because the meaning of sort
is not the same as the other two per the discussion above.
In
categorical.py
, we enforce the fact thatself
must be ordered when callingmin
ormax
. However,self
can be unordered when callingsort_values
. This doesn't make sense in my mind, for if you can sort the values for unorderedself
, then I can then find a minimum value ofself
. The same comment applies toargsort
as well.