Open ResidentMario opened 7 years ago
this is kind of like _maybe_promote
, see here: https://github.com/pandas-dev/pandas/blob/master/pandas/types/cast.py#L230, though in this case its a validator, but same idea.
a lot of the validation is prob occuring now but at a lower level and with no consistency of error messages. A lot of the routines expect certain types for filling, IOW filling floats needs a compatible float/int (or would raise).
So a friendly high level check would be nice. The hard part about this issue is not the code changes, but the tests :>
Also collecting these tests into a standard place would be fine as well (this is tricky because we like to keep the with the types, e.g. in pandas/tests/series
, but for example routines in pandas/tests/missing
would be nice as well.
I'm hopeful I can figure out how to implement. Why are the tests the hard part? I assume you mean figuring out where to put them, which does sound challenging...might collect them in a separate file instead, just for the meantime.
_maybe_upcast
doesn't have guards against upcasting "weird" stuff. So for example the following is legal:
_maybe_upcast(np.array([np]))
When a fill_value
parameter is passed, _maybe_upcast
is the first and only validation step that parameter has to go through. So since the above is legal, so is whatever garbage you pass it, e.g. Series.shift(fill_value=<class 'garbage_type'>)
.
In other cases (e.g. fillna
) there is additional validation that prevents this from happening.
Should _maybe_upcast
(continue to) allow this behavior? This is deep in the internals, so I suspect not touching it would be best, but it does seem like an odd thing to allow, to me.
It wouldn't be too hard to add a separate check to prevent this sort of input from reaching _maybe_upcast
at all.
(sorry about the close/open, fat-fingered the wrong button there)
you might be able to add a check here though i suspect have a _maybe_cast_fill which does validation might be easier
internal routines just have implicit (or better yet explicit guarantees that are in the docstring)
FYI, this is legal in fillna
right now:
pd.Series([1, 2, np.nan]).fillna(lambda f: f)
Which is counter-factual w.r.t a (separate) TypeError
statement in the method body:
if isinstance(value, (list, tuple)):
raise TypeError('"value" parameter must be a scalar or dict, but '
'you passed a "{0}"'.format(type(value).__name__))
(to fix this you could do if isinstance(value, (list, tuple)) or callable(value)
)
yeah you prob have to check inclusion rather than exclusion
e.g. is_scalar, is_dict_like, is_list_like
Yeah. Funny little bug with that:
>>> import import pandas.core.common as com
>>> com.is_string_dtype(type)
True
In com.is_string_type
:
dtype.kind in ('O', 'S', 'U')
The type checker naively assumes that if you passed it an object, it must have been a string!
is_string_dtype
is not strict. It really can't be w/o a lot of code inference (which is not cheap). You can certainly add a comment to it if you'd like. If you really need inference then you can do lib.infer_type
which IS strict. (but again is not free, though not too bad as it short-circuits).
@ResidentMario note that imports from pandas.core.common
are almost all deprecated, use pandas.types.common
@ResidentMario Is the description at the top of this issue still up to date with how you are trying to implement things in #15587 ?
Hmm. This list is incomplete, and I think there's been a couple of changes there:
Period
is O
dtype, the implementation there uses object
rules for Period
columns because of that. (probably just need to investigate this further?)datetime64[ns]
columns, not numerical types (so no int
, float
, etc.).A big question right now is whether or not in the case of a DataFrame
we want to validate column-by-column or consolidate the dtype
(probably into object
) and use the rules for filling that instead.
Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)
yes this is a special case atm, you you can simply use is_period_arraylike
on an object column to check, and if true, then restrict the fill value.
I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).
yes
A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.
I think a reasonable way to do this is to:
errors='ignore'|'raise'|'force'
kw to .fillna*
routineserrors='ignore'
(default), then allow filling of a column only if the types match (IOW don't fill datetimes with ints, just skip them)errors='raise'
then raise on anything that is not compat with the fillingerrors='force'
then coerce the columns as needed (even to object
).this would give nice behavior by default of filling things that can take that value and providing error checking otherwise (with an option for force
filling, but that's user selected).
The current situation is effectively errors='force'
.
In [2]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})
In [3]: df
Out[3]:
A B
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [4]: df.iloc[1] = np.nan
In [5]: df
Out[5]:
A B
0 1.0 2013-01-01
1 NaN NaT
2 3.0 2013-01-03
In [6]: df.fillna(0)
Out[6]:
A B
0 1.0 2013-01-01
1 0.0 1970-01-01
2 3.0 2013-01-03
In [7]: df.fillna(pd.Timestamp('20130110'))
Out[7]:
A B
0 1 2013-01-01
1 2013-01-10 00:00:00 2013-01-10
2 3 2013-01-03
I suppose we could also make the default errors='raise'
, though not back-compat . This would be more obvious. errors='ignore'
is more convenient.
Ok so then:
errors
param for fillna
via a new validator func (#11953).fill_value
routines with the default validator func behavior (#15533; PR#15587).errors
param to shift
(#15486; PR#15527).I suggest also adding a new pd.set_option
param for letting the user pick their error coercion mode if they want.
Problem
In the PR for #15486, I found that type validation for the
fill_value
parameters strewn across a large number ofpandas
API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used byfillna
.Implementation Details
Partially the point of providing a
fill_value
is to avoid having to do a slow-down type conversion otherwise (using.fillna().astype()
). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:Before executing the rest of the method body, check whether or not the
fill_value
is valid (using a centralizedmaybe_fill
method). If it is not, throw aValueError
. If it is, check whether or not incorporating thefill_value
would result in an upcast in the columndtype.
If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like afilla
operation at the end before returning.Target Implementation
The same as what
fillna
currently does. Which follows.Invalid:
categorical
fill for a category not in the categories will raise aValueError
.sparse
matrices refuse upcasting.Valid, upcast:
int
fill will promotebool
dtypes toint
.float
fill will promoteint
andbool
dtypes tofloat
(this is what happens withnp.nan
already).object
(str
) fill would promote lesser dtypes toobject
.int
,float
, andbool
fill to adatetime
dtype will be treated as a UNIX-like timestamp and promoted todatetime
.object
fill will promotedatetime
dtype toobject
.Valid, no-cast:
Current Implementation
...is ad-hoc. The following are the methods which currently provide a
fill_value
input, as well as where they deviate from the model above.Series.combine
,DataFrame.combine
,Series.to_sparse
: These are unique usages offill_value
which aren't compatible with the rest of them.Series.unstack
,DataFrame.unstack
: anyfill_value
is allowed. You can pass an object if you'd like, or even anotherDataFrame
(yo dawg...).DataFrame.align
: Anyfill_value
is allowed.DataFrame.reindex_axis
: Lists and dicts are allowed, objects are not.DataFrame.asfreq
,Series.asfreq
: anyfill_value
is allowed.pd.pivot_table
: ...Series.add
,DataFrame.add
: ...Series.subtract
,DataFrame.substract
: ...Probably others, there's a lot of these.