ENH: standardize fill_value behavior across the API

ResidentMario commented 7 years ago

Problem

In the PR for #15486, I found that type validation for the fill_value parameters strewn across a large number of pandas API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used by fillna.

Implementation Details

Partially the point of providing a fill_value is to avoid having to do a slow-down type conversion otherwise (using .fillna().astype()). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:

Before executing the rest of the method body, check whether or not the fill_value is valid (using a centralized maybe_fill method). If it is not, throw a ValueError. If it is, check whether or not incorporating the fill_value would result in an upcast in the column dtype. If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like a filla operation at the end before returning.

Target Implementation

The same as what fillna currently does. Which follows.

Invalid:

categorical fill for a category not in the categories will raise a ValueError.
sparse matrices refuse upcasting.
Passing an object or list or other non-coercable "thing" as a fill.

Valid, upcast:

int fill will promote bool dtypes to int.
float fill will promote int and bool dtypes to float (this is what happens with np.nan already).
object (str) fill would promote lesser dtypes to object.
int, float, and bool fill to a datetime dtype will be treated as a UNIX-like timestamp and promoted to datetime.
object fill will promote datetime dtype to object.

Valid, no-cast:

Everything else.

Current Implementation

...is ad-hoc. The following are the methods which currently provide a fill_value input, as well as where they deviate from the model above.

Series.combine, DataFrame.combine, Series.to_sparse: These are unique usages of fill_value which aren't compatible with the rest of them.
Series.unstack, DataFrame.unstack: any fill_value is allowed. You can pass an object if you'd like, or even another DataFrame (yo dawg...).
DataFrame.align: Any fill_value is allowed.
DataFrame.reindex_axis: Lists and dicts are allowed, objects are not.
DataFrame.asfreq, Series.asfreq: any fill_value is allowed.
pd.pivot_table: ...
Series.add, DataFrame.add: ...
Series.subtract, DataFrame.substract: ...
Probably others, there's a lot of these.

jreback commented 7 years ago

this is kind of like _maybe_promote, see here: https://github.com/pandas-dev/pandas/blob/master/pandas/types/cast.py#L230, though in this case its a validator, but same idea.

a lot of the validation is prob occuring now but at a lower level and with no consistency of error messages. A lot of the routines expect certain types for filling, IOW filling floats needs a compatible float/int (or would raise).

So a friendly high level check would be nice. The hard part about this issue is not the code changes, but the tests :>

Also collecting these tests into a standard place would be fine as well (this is tricky because we like to keep the with the types, e.g. in pandas/tests/series, but for example routines in pandas/tests/missing would be nice as well.

ResidentMario commented 7 years ago

I'm hopeful I can figure out how to implement. Why are the tests the hard part? I assume you mean figuring out where to put them, which does sound challenging...might collect them in a separate file instead, just for the meantime.

ResidentMario commented 7 years ago

_maybe_upcast doesn't have guards against upcasting "weird" stuff. So for example the following is legal:

_maybe_upcast(np.array([np]))

When a fill_value parameter is passed, _maybe_upcast is the first and only validation step that parameter has to go through. So since the above is legal, so is whatever garbage you pass it, e.g. Series.shift(fill_value=<class 'garbage_type'>).

In other cases (e.g. fillna) there is additional validation that prevents this from happening.

Should _maybe_upcast (continue to) allow this behavior? This is deep in the internals, so I suspect not touching it would be best, but it does seem like an odd thing to allow, to me.

It wouldn't be too hard to add a separate check to prevent this sort of input from reaching _maybe_upcast at all.

ResidentMario commented 7 years ago

(sorry about the close/open, fat-fingered the wrong button there)

jreback commented 7 years ago

you might be able to add a check here though i suspect have a _maybe_cast_fill which does validation might be easier

internal routines just have implicit (or better yet explicit guarantees that are in the docstring)

ResidentMario commented 7 years ago

FYI, this is legal in fillna right now:

pd.Series([1, 2, np.nan]).fillna(lambda f: f)

Which is counter-factual w.r.t a (separate) TypeError statement in the method body:

    if isinstance(value, (list, tuple)):
        raise TypeError('"value" parameter must be a scalar or dict, but '
                        'you passed a "{0}"'.format(type(value).__name__))

(to fix this you could do if isinstance(value, (list, tuple)) or callable(value))

jreback commented 7 years ago

yeah you prob have to check inclusion rather than exclusion

e.g. is_scalar, is_dict_like, is_list_like

ResidentMario commented 7 years ago

Yeah. Funny little bug with that:

 >>> import import pandas.core.common as com
 >>> com.is_string_dtype(type)
 True

ResidentMario commented 7 years ago

In com.is_string_type:

dtype.kind in ('O', 'S', 'U')

The type checker naively assumes that if you passed it an object, it must have been a string!

jreback commented 7 years ago

is_string_dtype is not strict. It really can't be w/o a lot of code inference (which is not cheap). You can certainly add a comment to it if you'd like. If you really need inference then you can do lib.infer_type which IS strict. (but again is not free, though not too bad as it short-circuits).

jreback commented 7 years ago

@ResidentMario note that imports from pandas.core.common are almost all deprecated, use pandas.types.common

jorisvandenbossche commented 7 years ago

@ResidentMario Is the description at the top of this issue still up to date with how you are trying to implement things in #15587 ?

ResidentMario commented 7 years ago

Hmm. This list is incomplete, and I think there's been a couple of changes there:

Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)
I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

jreback commented 7 years ago

Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)

yes this is a special case atm, you you can simply use is_period_arraylike on an object column to check, and if true, then restrict the fill value.

I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

yes

jreback commented 7 years ago

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

I think a reasonable way to do this is to:

add a errors='ignore'|'raise'|'force' kw to .fillna* routines
if errors='ignore' (default), then allow filling of a column only if the types match (IOW don't fill datetimes with ints, just skip them)
if errors='raise' then raise on anything that is not compat with the filling
if errors='force' then coerce the columns as needed (even to object).

this would give nice behavior by default of filling things that can take that value and providing error checking otherwise (with an option for force filling, but that's user selected).

The current situation is effectively errors='force'.

In [2]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})

In [3]: df
Out[3]: 
   A          B
0  1 2013-01-01
1  2 2013-01-02
2  3 2013-01-03

In [4]: df.iloc[1] = np.nan

In [5]: df
Out[5]: 
     A          B
0  1.0 2013-01-01
1  NaN        NaT
2  3.0 2013-01-03

In [6]: df.fillna(0)
Out[6]: 
     A          B
0  1.0 2013-01-01
1  0.0 1970-01-01
2  3.0 2013-01-03

In [7]: df.fillna(pd.Timestamp('20130110'))
Out[7]: 
                     A          B
0                    1 2013-01-01
1  2013-01-10 00:00:00 2013-01-10
2                    3 2013-01-03

I suppose we could also make the default errors='raise', though not back-compat . This would be more obvious. errors='ignore' is more convenient.

ResidentMario commented 7 years ago

Ok so then:

New PR implementing an errors param for fillna via a new validator func (#11953).
PR implementing that validator func in the various fill_value routines with the default validator func behavior (#15533; PR#15587).
PR adding the errors param to shift (#15486; PR#15527).

I suggest also adding a new pd.set_option param for letting the user pick their error coercion mode if they want.

pandas-dev / pandas