pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.83k stars 17.65k forks source link

API: astype mechanism for extension arrays #22384

Open TomAugspurger opened 5 years ago

TomAugspurger commented 5 years ago

In https://github.com/pandas-dev/pandas/pull/22343, we're bumping into a few problems around casting to / from extension arrays.

  1. Our current default implementation for ExtensionArray.astype doesn't handle target dtypes that are extension types (so extension_array.astype('category') fails). At the moment, each EA will have to implement their own astyping, which is error prone and difficult.
  2. Some EAs may be more opinionated than others about how astyping should be done. There may exist fastpaths between certain EAs, but only one of the source and destination types may know about that fast path. This (I think) calls for a kind of dispatch mechanism, where each gets to say whether they know how to handle the astyping.

I'm not sure how far down this road we want to go. Should we instead encorage users to use explicit constructors like .assign(x=pd.Categorical(...) rather than .astype('category')?

TomAugspurger commented 5 years ago

To be clear, there are a couple levels we could take on here.

At a minimum, I think .astype('dtype_name') where dtype_name is in our registry, should work, even if it's inefficient. That can be solved before going down a full dispatch mechanism.

jorisvandenbossche commented 5 years ago

Let's try to summarise the (ideal) requirements:

Does that sound right?


In the PR I mentioned the idea of a "negotation protocol" (https://github.com/pandas-dev/pandas/pull/22343#issuecomment-412975823) similar to how operators work in Python:

I used here __astype instead of astype, because we don't want the public EA.astype to return NotImplemented (one of the concerns raised in the PR). As Tom noted a bit messy, but not too much I think.

The other concern raised is the divergence between Series.astype(dtype) and Series.values.astype(dtype) (where the first would be able to handle more dtypes. Eg Series.values.astype('category') would not work if the EA didn't implement that itself).

However, I think it should be possible (if the EA author wants that) to have this more consistent, and we could implement a base class astype example that does this, by calling dtype.construct_array_type()._from_sequence(..)., just as we do in pandas.


Another possibility would be that we provide some helper function for EA authors (and for internal use) that basically does the above, but simply inside the astype method. And then just always dispatch to EA.astype.

For example, the EA.astype method could look like:

def astype(self, dtype):
    if dtype ... [check if if I know what to do for this dtype]:
        ...
        return values
    else:
        return pandas_astype(self, dtype)

with

def pandas_astype(values, dtype):
    if is_extension_dtype(dtype):
        return dtype.construct_array_type()._from_sequence(values)
    else:
        return np.array(values, dtype=dtype)

But that then counts on the EA's actually doing this.

TomAugspurger commented 4 years ago

Sorry I missed your comment earlier Joris. Your list of ideal requirements sounds correct. Haven't thought much about the implementation yet.

Regardless, not something we can solve for 1.0 so pushing.

Dr-Irv commented 4 years ago

@jorisvandenbossche Based on something I wrote here (https://github.com/pandas-dev/pandas/issues/31204#issuecomment-578253063) with respect to StringDtype, I'm wondering if the rules you wrote above should be reversed.

You wrote:

When doing an astype operation (eg Series(EA).astype(dtype)), we first check if the values (the array backing the series) know how to do this conversion: if values is a numpy array, we call astype if dtype is a numpy dtype, otherwise go further with next step below if values is an EA, we call EA.__astype(dtype) -> if the EA knows how to handle the specific dtype, it returns a new EA/array; if it does not know how to convert itself to dtype, it returns NotImplemented and we continue with the next step Next, we check if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented

I think we ought to consider doing the reverse, namely:

When doing an astype operation (eg Series(EA).astype(dtype)), if the target dtype knows how to convert the values to its own dtype -> if it knows, it returns a new EA/array; otherwise it returns NotImplemented Then check if the values (the array backing the series) know how to do this conversion:

In the case of StringDtype, because the elements of an EA should have __str__ implemented, the conversion is straightforward. Asking each EA to "know" about StringDtype seems like overkill.

As I wrote there, I would think that the behavior of pd.Series(some_list, dtype=EAtype) and s.astype(EAtype) would be similar, namely if the EAtype knows how to convert some_list, then it just does it, so we prioritize letting the target dtype do the work, not the other way around.

giuliobeseghi commented 4 years ago

Just a few comments on this, not sure if they're helpful.

Input parameters

pd.Categorical and astype are not equivalent. You can't pass any "input" argument to astype ("input" defined as an argument that influences the conversion, like categories or ordered in pd.Categorical).

Should these arguments be allowed with astype (possibly as args and kwargs)? Probably not, because this is in conflict with how astype is designed. But then you don't have the functionality of pd.Categorical. So +1 on pd.Categorical, which is more versatile.

Return

pd.Categorical and astype return a different object.

import pandas as pd

series = pd.Series(["a", "c", "b", "c", "a", "b", "a", "c"])

explicit = pd.Categorical(series)  # dtype = categorical array
implicit = series.astype("category")  # dtype = series

pd.Categorical converts the input to a categorical array. astype preserves the input type, which I think is more in line with how pandas is designed, as it csn be chained properly. .assign(x=pd.Categorical(...) would convert the categorical array to a series, so it could be chained too, but it obviously works only with dataframes and not series. So +1 on astype.

Consistency with pd.to_datetime

I've always found type conversion with pandas types a bit confusing. For example, with datetimes:

series = pd.Series([str(y) for y in range(2000, 2020)])
series.astype('datetime64[ns]')  # this works
pd.to_datetime(series)  # this works too

Which one should I use? I guess they're interchangeable, because no warning is raised. But it seems to me that pd.to_datetime is somehow preferred.

By the way, why not a pd.to_categorical as pd.to_datetime? You could pass the args of pd.Categorical and you would always get the right type as output. Also, it unifies the convention of using a "to_pandasdtype" method to convert pandas objects to the right pandas type. I'm kind of +3 on this, before hearing your thoughts :)

TomAugspurger commented 4 years ago

You would astype to a CategoricalDtype, not Categorical.

giuliobeseghi commented 4 years ago

Ok, I got confused with pd.Categorical. Sorry for that!

jbrockmendel commented 3 years ago

@jorisvandenbossche some suggestions to add to the list of desired behavior:

jbrockmendel commented 3 years ago

@jorisvandenbossche @TomAugspurger on the last nullable call we discussed this briefly, including the need to have a mechanism analogous to ndarray.astype's casting kwarg (in some places we have an errors keyword that fills a similar role)

In #40110 im leaning towards becoming (with deprecation cycle) stricter in the constructors, disallowing certain Series(values, dtype=foo) usages and directing users to use Series(values).astype(foo) instead. This breaks an equivalency we usually like, but ATM i think is the least-bad alternative.

I don't want to go down that road until I'm reasonably confident it will be consistent with what we're doing with EA.astype/_from_sequence. Do we have consensus on making astype weakly less restrictive than _from_sequence?

jorisvandenbossche commented 3 years ago

In the potential future world of having an "astype mechanism", that could include some way to specify a casting/errors behaviour, and then both astype() as the constructor (_from_sequence) could still rely on the same mechanism, but with potentially a different default for this casting/error behaviour?

(all a bit hypothetical, but to say that I think we can move towards different default behaviour (if desired) without giving up on unifying the underlying implementation mechanism)

I think I am personally fine to only allow "safe" casts by default in the constructor, to not have surprising change of values, but will take a look at the discussion in #40110 and try to comment there.

jorisvandenbossche commented 2 years ago

Warning, long wall of text coming .. This is a first attempt to write down some thoughts and considerations related to casting in pandas (not yet a concrete proposal).

Different casting "kinds"

Numpy's astype has a casting keyword to specify what kind of casting can be allowed. Numpy distinguishes: "equivalent" (only byte-order changes, not relevant for pandas I think), "safe" (preserving values), "same_kind" (safe cast within a kind) and "unsafe" (any data conversion is allowed).

Thinking through the different kinds of casting we encounter in pandas (especially we consider different kinds of "safe"):

Use cases of casting

Casting happens in several places, and the different cases might require that we add the different casting "kinds" from above as explicit options:

There is one more specific casting-like use case, and which is what is being discussed in https://github.com/pandas-dev/pandas/issues/33254 (the possible _from_scalars). This is kind of a "object" -> "target dtype" cast, but with a specific requirement of being strict about the scalars in the sequence (this is even more strict as the "strict safe" casting from above, because different types can still evaluate to equal (eg Timestamp and datetime.datetime, or ints and round floats). So not fully sure we should include this in the use cases / casting kinds, or consider it separately.

Some other random thoughts / observations

With the recent dtype refactor in numpy, there is a lot of interesting prior art to look at (especially the content in https://numpy.org/neps/nep-0042-new-dtypes.html). For an actual extensible casting mechanism, we might want to look at what numpy is doing and mimic that in a python-level interface on our dtypes in a simplified manner (although it might still be a bit overly complicated for our use cases). I will try to summarize next what such a simplified numpy-like python interface would look like).

TomAugspurger commented 2 years ago

Thanks for writing this up. A couple of thoughts:

In general, I agree that we should aim for (strict) safe casting by default. I think I'm more OK with erroring during casting, when e.g. an overflow is detected, than Joris is. That makes casting dtype-stable, but it might raise an exception.

And agreed that astype could grow a keyword to control this. If we move to strict by default everywhere (including concat) then we can direct users to astype prior to the concat.

I do wonder how much of this we have to figure out in order to implement "cast between two arbitrary dtypes". I agree that we should at least have a policy of how libraries should behave.

jorisvandenbossche commented 2 years ago

In general, I agree that we should aim for (strict) safe casting by default. I think I'm more OK with erroring during casting, when e.g. an overflow is detected, than Joris is. That makes casting dtype-stable, but it might raise an exception.

Yes, I think I am also fine with runtime-errors (eg for overflow). I find it more important to keep the result (if there is no error) dtype-stable.

About a possible default casting level, I assume that the "loose" safe casting might be fine as well for astype? For example converting ints to string with astype("string") should maybe not require an explicit keyword to allow this? This is different for concat though, where I think we only want strict safe casting? So concatting ints with strings would continue giving object dtype. Or would you propose to stop using this "fallback to object if no common safe cast" behaviour and raise an error in those cases? (eventually, after a deprecation)

And agreed that astype could grow a keyword to control this. If we move to strict by default everywhere (including concat) then we can direct users to astype prior to the concat.

After the meeting last week, I was thinking in which other places (apart from astype) where such a casting keyword could also be added. And so we could also have a similar casting keyword in concat? (instead of having the user cast each of the columns of the objects manually in advance) So that you could control whether to only allow strict safe casts, or fallback to object if not, or to error if not, ..

Although the question might be what should happen in the strict safe cast inside a concat operation does give a runtime error. In that case, we might not want to fallback to object (as we currently do for casts that we know in advance are not safe), as that would defeat the dtype stability of the operation (so in such a case, requiring an explicit astype of the user in advance might be the only option).

jbrockmendel commented 2 years ago

where such a casting keyword could also be added

The thing that comes to mind is the errors keyword in e.g. to_datetime, which serves a similar purpose (though i doubt itd be worth the trouble to deprecate that API)

Some thoughts:

1) How much of a performance hit does it take to check for overflows etc? IIRC there was an issue about our dt64 casting being slower than numpy's that was driven by the fact that ours does that check.

2) Do we want to preserve (or improve, since it doesn't always work) the equivalent pd.Series(values).astype(dtype) \equiv pd.Series(values, dtype=dtype)? If so, does that mean any keywords added to astype need to be added to the constructors?

3) Is it viable to list out the types of safety or user-configurability we want? If not, then does that mean we need to just pass through something like kwargs? Some non-standard but non-crazy things we could do via astype (not advocating, just brainstorming):

- dt64series.astype(dt64tz, tz_localize=True, ambiguous=...)
- series.astype(str, na_rep=..., date_format=...)

4) There are some astypes that numpy allows that we don't (in some cases deprecated but not yet enforced)

Would we want to have keywords to re-allow these?

5) I think at one point there was a discussion of having floats.astype(int) fail if it was lossy, requiring users to first do floats.round().astype(int). I don't think we should do this, but it should go on the list of potential types of "safety"

jorisvandenbossche commented 2 years ago

Thanks, that's useful!

The thing that comes to mind is the errors keyword in e.g. to_datetime, which serves a similar purpose (though i doubt itd be worth the trouble to deprecate that API)

The other way around, that also brings up the question whether the behaviour of errors="coerce" (converting the non-castable/convertable values to NA/NaNs) is something we want in astype as well. I would maybe say no, and leave that for the more specialized conversion functions.

1. How much of a performance hit does it take to check for overflows etc? IIRC there was an issue about our dt64 casting being slower than numpy's that was driven by the fact that ours does that check.

It will be slower, that's for sure. As a quick dummy check to get an idea of magnitude, for checking for overflow of integers:

>>> arr = np.random.randint(0, 1000, size=1_000_000)
>>> result = arr.astype("int32")
>>>(result == arr).all()
True

In [4]: %timeit arr.astype("int32")
727 µs ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit (result == arr).all()
1.44 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it certainly could give a significant overhead. To check with an actual implementation, Arrow has both versions of the cast, and that gives quite similar results:

In [6]: pa_arr = pa.array(arr)

In [7]: %timeit pa_arr.cast("int32")
1.93 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit pa_arr.cast("int32", safe=False)
789 µs ± 7.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, even if we would make safe casts the default (and thus check for this), there will always be the option with some keyword to disable this if you care about performance + know that you won't have values that can overflow (but you need to know that, of course).

Another note here is that we already check for NaN/Inf in float->integer conversion (also related to the last point below), so our astype is also slower than the numpy one in this case (almost 2x slower). So if we change the current isfinite check to an equality check as the above, the performance difference with the current situation might actually be limited for this specific case).

2. Do we want to preserve (or improve, since it doesn't always work) the equivalent pd.Series(values).astype(dtype) \equiv pd.Series(values, dtype=dtype)? If so, does that mean any keywords added to astype need to be added to the constructors?

Ideally yes, I think that should still be a goal. And the behaviour of Series() should be consistent with at least one of the casting levels of astype? If we change astype to safe casting, and we would like to keep the same casting level in Series() by default, that might of course also change some behaviour there. Although it seems that for some cases we actually already do safer casting in Series. For example, pd.Series([1.0, 2.5], dtype=int) raises an error, and pd.Series([1, 1000]).astype(dtype="int8") gives a warning that it will raise in the future. So changing astype to be safer by default, might actually make it easier to have those consistent (didn't check other dtypes, though).

3. Is it viable to list out the types of safety or user-configurability we want? If not, then does that mean we need to just pass through something like kwargs? Some non-standard but non-crazy things we could do via astype (not advocating, just brainstorming):

Personally, I think I would prefer to keep astype simple/generic, and have specialized conversion methods for things (like the string formatting example) that require more custom user control (that also makes those easier to document).

So that means that the levels of safety will need to be somewhat generic as well. We certainly need to try to list the levels we might want (will try to do that in a separate comment).

4. There are some astypes that numpy allows that we don't (in some cases deprecated but not yet enforced)

pd.Series([1, 2, np.nan]).astype(np.int64)

We currently raise for this case, and I think we want to keep doing that? In principle we could relax that and follow numpy with an opt-in unsafe casting level.

pd.Series([1, 2, 3], dtype="M8[ns]").astype(np.int64)

We currently raise a warning about the dt64 -> int64 that this will raise instead in the future and to use view instead. Do you know what the rationale was for this deprecation?

5. I think at one point there was a discussion of having floats.astype(int) fail if it was lossy, requiring users to first do floats.round().astype(int). I don't think we should do this, but it should go on the list of potential types of "safety"

Yes, I think that should certainly be one of the casting levels, but have to think a bit about where it could fit (in general, float64->int64 is an unsafe cast, but depending on the values it can be safe ..). About having this stricter check be the default or not, I think it could be a useful default as well (and I think @TomAugspurger you indicated to be in favor of safe casting above, does that include float->int casting?). For example, assume you have integer values that for some reason are stored as float (certainly in pandas/numpy, there are good reasons for this, as missing values in the original data). With a safe default, Series[float].astype(int) would actually ensure that you only had integer values in the series, and that you are not loosing any data. If you are fine with rounding values, you can still pass a keyword to enable to unsafe casting, or you can more explicitly call round (or ceil or floor, if we supported those).
Another reason why it might actually be good that the user has to do this explicitly is that astype(int) and round().astype(int) don't give the same result for all values .. (it seems astype is effectively rounding towards zero, while the round behaviour is to round to the nearest even integer)

jbrockmendel commented 2 years ago

We currently raise a warning about the dt64 -> int64 that this will raise instead in the future and to use view instead. Do you know what the rationale was for this deprecation?

The idea is that 'astype' in general means something like 'the same information but in a different format' and that doesn't make sense with i8 <-> M8

jorisvandenbossche commented 2 years ago

I agree that in general (at least the safe default casting) should be about the "same information in different format", but I think that's not a hard line, and you could certainly argue that integers hold the same information (given you known the resolution, tz, etc), although that's certainly debatable :)

But, on the specific datetime -> integer deprecation:

jbrockmendel commented 2 years ago

There is no ambiguity around what the expected result would be IMO (for naive datetimes / timedelta)

ehh thats true if you know about how dt64 is stored under the hood. Gibberish for users who have never heard of unix timestamp. I have no real guess how common this knowledge is among users.

The other way around (integer -> datetime / timedelta) is not deprecated

Yah, IIRC the reason we didn't/haven't done that deprecation is bc we dont want to deprecate i8frame.astype(dt64) without also deprecating DataFrame(i8data, dtype='M8[ns]'), and there was reticence about the latter. I agree the inconsistency here isn't great.

The other thing that comes to mind is what to do with NaT in an dt64.astype(np.int64). With astype_nansafe we refuse to cast np.nan to a nonsense integer; would you want to do the same for NaT?

jorisvandenbossche commented 2 years ago

I don't really know how common the knowledge is of the datetime64 details, but there are certainly advanced use case of wanting to convert it to integers. Also, if you do astype(int) on a datetime column, you explicitly want to convert it and you probably only do this if you have some knowledge about it / know what this conversion could do.

The other thing that comes to mind is what to do with NaT in an dt64.astype(np.int64). With astype_nansafe we refuse to cast np.nan to a nonsense integer; would you want to do the same for NaT?

That's a good point. Yes, I think we should here also raise by default (and have a way to turn of the check / error to get the unsafe cast)

jbrockmendel commented 2 years ago

but there are certainly advanced use case of wanting to convert it to integers.

I see this as exactly a use case of .view('i8').

Is there a viable option to punt on this before we go around in circles?

jorisvandenbossche commented 2 years ago

Is there a viable option to punt on this before we go around in circles?

We can at least move it to a separate discussion. Will open a new issue or PR for this.

jorisvandenbossche commented 2 years ago

An overview of some dtype casting combinations, and the kind / safety of casting that they could allow:

jorisvandenbossche commented 2 years ago

General observation: it might actually be sufficient to consider only two kinds of casting: supported and unsupported casts (i.e. casts that always raise a TypeError). And for the supported casts, we could always require a safe cast by default (i.e. value preserving), with the option to turn off checks to get unsafe cast.

If you look in the overview above for the casts that can work in some cases, numpy considers some safe and some same_kind or unsafe. But in practice, for all those cases, it can be equality-preserving (depending on the actual value). So for pandas point of view, I would not make this distinction up-front about safe/unsafe casts (based on the dtypes), but only handle runtime errors (out of bound values, values that would overflow or get truncated, etc).

TomAugspurger commented 2 years ago

I think we should agree on a couple principles, and work from those:

  1. pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
  2. pandas should by default error at runtime when type casting causes loss of equality (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

Using these rules, especially 2, I think I agree with Joris' point:

I would not make this distinction up-front about safe/unsafe casts (based on the dtypes), but only handle runtime errors (out of bound values, values that would overflow or get truncated, etc).

jbrockmendel commented 2 years ago

@seberg im curious if any of the types of safety discussed here are going to be handled differently in new numpy casting stuff you're working on

Dr-Irv commented 2 years ago
  • pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).

I agree with this, and I would like to bring up what I wrote at https://github.com/pandas-dev/pandas/issues/22384#issuecomment-578774638 , since I think it is related to the current discussion:

We have to decide where the astype is implemented. Let's say that you have

s1 = pd.Series(somelist, dtype=dtype1)
s2 = s1.astype(dtype2)

where dtype1 and dtype2 are arbitrary types. They could be integer, datetimes, or any EA type.

I believe that in pandas today, the following happens (please correct me if I'm wrong):

I think the second behavior should change. When astype is called on a series of dtype1, the underlying EA implementing dtype2 is responsible for converting objects of type dtype1 to dtype2. Then the same implementation for converting lists/arrays of any type in constructing a Series of dtype2 can be used to convert a Series of any type calling astype(dtype2).

I believe that if we think of it this way, we accomplish the goal of making Series(values, dtype=dtype) and values.astype(dtype=dtype) consistent, and then it is up to the "receiving" dtype of astype() to decide (and possibly document) how it will convert objects of various types.

This then lets us separate discussions into questions about how each receiving type should convert objects of different types based on the receiving type.

Hope this makes sense.

seberg commented 2 years ago

The casting safety is something I have not really attempted to fix/attack. The one general comment that I have (which is probably either useless or enlightening) is that I consider "common dtype" and casting safety as completely unrelated. The other one is that "safe" could also mean safe from a type perspective (i.e. bool → int is not really safe), but actually means more "round-tripping" right now.

From a user-perspective, I also think that an "attempt" casting safety would make sense (I guess your safe=True). Which tries to the cast, but fails if the result would have lost any information (i.e. a typical "safe" cast should never do this, but many unsafe casts could be attempted).

I designed parts of the new API in a way that we could add new casting safety levels at any time, but attempting a cast that may be safe is a bit different, since it requires propagating the information to the actual casting function: It is possible to grow in NumPy, but we have to design new API!

pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).

This one is a tricky one :(. NumPy force-coerces (effectively force casts) for np.array(values, dtype=dtype). I would love to get away from this also. This requires some API decisions though, and it is trickier than for normal casting. We can add that new API though! (My main annoyance is that the normal casting code can be designed in an extensible way, but I am not sure that is feasible for the scalar coercion code.)

(Not sure these notes actually help, basically, I have some thoughts about it, but little API decisions. And I think NumPy should try to move a bit eventually, but I am not exactly sure how/where.)

EDIT: sorry, the comment about common dtype is esoteric, probably just ignore it... It has some reason in: arr.astype(dtype=Integral, casting="safe") which could fall back to the common-dtype operation to decide which integral to use before checking casting.

seberg commented 2 years ago

Maybe I should just say what I do for NumPy's normal casting: I return the casting level from its implementation. I.e. a cast is described by an object:

class Int64ToFloat64Cast(ArrayMethod):
    DTypes = [Int64, Float64]  # classes, not instances

    def resolve_descriptors(self, input_output_dtypes):
         # Handles all preparation logic, but does not know the values
         # returns the dtypes supported by the C cast and the casting safety.
         return (Int64, Float64()), "unsafe"

    def get_loop(...):
        # The C cast function (basically a ufunc inner-loop).
        return c_function_that_does_the_cast

So the resolve_descriptors tells NumPy whether it is a safe cast (or actually if it is just a view). That could indicate if an "attempt" cast is feasible, but I would require some new API for get_loop or the actual inner-loop to handle the rest.

(The distinct resolve_descriptors step is very important, because it allows, for example: Chaining multiple casts and allocating the result arrays, details of which may be important for get_loop)

jbrockmendel commented 2 years ago

A suggestion for the list of principles: obj.astype(dtype).dtype == dtype (doesnt currently hold for Sparse xref #34457)

adamgranthendry commented 2 years ago

Per #44117, please consider adding keyword argument errors (default value raise) for astype to optionally convert all values to a specified type (pd.NA or np.nan if cannot cast) at creation. (Users often know the data types they want to convert their columns to at creation.)

e.g. Currently, the following raises a ValueError:

>>> a = pd.Series(['a', 1, 3, ''], dtype=np.int)
ValueError: invalid literal for int() with base 10: 'a'

However, it would be nice to have (something like) the following capability:

>>> a = pd.Series(['a', 1, 3, '']).astype(np.int, errors="coerce")
>>> a
0    NaN
1    1
2    3
3    NaN
dtype: int32

Such an enhancement would centralize functionality existing in the multiple to_...() methods and also mimick the converters argument to the method read_excel.

(ASIDE: DataFrame.convert_dtypes is insufficient since it doesn't have coerce functionality, so the Series a in the above example would simply be converted to type object rather than the desired dtype)

jbrockmendel commented 2 years ago

xref #45151

jbrockmendel commented 2 years ago

Re-reading this thread, I think it can be split into two more-or-less distinct issues:

1) How do we determine whether .astype dispatches to _from_sequence or the other way around?

2) What types of safety/coercion to do we want to enforce/allow?

For 1), ATM the base class EA.astype dispatches to _from_sequence, so de facto I think we've landed on a Reasonable Default (tm). Is there still a need for something like a negotiation protocol? I move that we declare this problem solved until a compelling use case is spotted.

For 2), I'm increasingly skeptical of a one-size-fits-all keyword argument with a fixed set of options. I'm wavering between a) "one-size-fits-most to handle the most common 90%", b) "something customizable", c) "the status quo isn't so bad"

jorisvandenbossche commented 2 years ago

So for the second point:

2. What types of safety/coercion to do we want to enforce/allow?

I opened a new, dedicated issue (-> https://github.com/pandas-dev/pandas/issues/45588), so we can further focus here on the first point (potential dispatch mechanism)

Per #44117, please consider adding keyword argument errors (default value raise) for astype to optionally convert all values to a specified type (pd.NA or np.nan if cannot cast) at creation.

@adamgranthendry yes, we have always had a bit the divide between "specialized" conversion functions (eg to_datetime, to_numeric) that have this functionality (and more) versus the generic casting (astype) method, and to what extent astype should be able to do everything that the more specialized ones can do. I think we also have older issues about this topic, but for now I reopened your issue, because I think we can keep that as a separate discussion (whatever mechanism we decide here, there will always be some place to pass extra keywords, just as we will have to pass a keyword for casting safety).

jorisvandenbossche commented 2 years ago

On the actual topic of a "mechanism" to dispatch to the source or target dtype to perform the actual cast, I made some explanatory code mock-ups last week, to think through what a potential mechanism would look like or what the advantages / disadvantages would be.

Numpy-like CastImplementation registry

First, I tried to make a mockup of a mechanims that mimics numpy (https://numpy.org/neps/nep-0042-new-dtypes.html#the-cast-operation), but a little bit simplified. But it keeps the idea of a CastImplementation class and instances of this can be registered for a specific (source_dtype -> target_dtype) combination

The abstract API for a base class and the registry:

class CastImplementation:

    def resolve_dtypes(
        self, dtypes: Tuple[ExtensionDtype, ExtensionDtype], **kwargs
    ) -> Tuple[ExtensionDtype, ExtensionDtype]:
        """given the from/to dtypes, for which dtypes is the cast implemented"""
        pass

    def do_cast(
        self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs
    ) -> ExtensionDtype:
        """does the actual cast"""
        pass

cast_registry = {}

def register_cast_impl(dtypes, cast_impl: CastImplementation | None):
    """Register cast implementation for (source, target) dtype combination.
    Register as `None` for explicitly registering it as an unsupported cast
    """"
    if dtypes in cast_registry:
        raise ValueError("cast already registered")
    cast_registry[(dtype_from, dtype_to)] = cast_impl

def get_cast_impl(dtypes) -> CastImplementation | None:
    return cast_registry.get(dtypes, DefaultFallBackCast)

(the DefaultFallBackCast could be one that falls back to _from_sequenceof the target dtype)

Dtypes can then implement and register those, here as simple example a subset of what would be registered for (Float, Int):

class FloatToIntCast(CastImplementation):

    def resolve_dtypes(self, dtypes, **kwargs):
        # non-parametrized dtypes, so always return the input
        return dtypes

    def do_cast(self, array, dtype, **kwargs):
        return astype_float_to_int(array, dtype, **kwargs)

for dtype_from in [pd.Float32Dtype, pd.Float64Dtype]:
    for dtype_to in [pd.Int8Dtype, pd.Int16Dtype, pd.Int32Dtype, pd.Int64Dtype]:
        register_cast_impl((dtype_from, dtype_to), FloatToIntCast)

and then the actual astype method can be implemented as a single function that handles all potential dtypes:

def astype(arr, dtype, **kwargs):
    # get the cast implementation class (based on dtype classes, not instances)
    cast = get_cast_impl((type(arr.dtype), type(dtype)))

    # if cast is known to be not supported:
    if cast is None:
        raise TypeError(f"Cannot cast from {arr.dtype} to {dtype}.")

    # in case of parametrized dtypes, check from which to which dtype we can cast
    resolved_dtype_from, resolved_dtype_to = cast.resolve_dtypes((arr.dtype, dtype))

    # if needed, first cast the input array to a different parametrization
    if arr.dtype != resolved_dtype_from:
        cast_from = get_cast_impl((type(arr.dtype), type(arr.dtype)))
        arr = cast_from.do_cast(arr, resolved_dtype_from)

    # do the main cast
    arr = cast.do_cast(arr, resolved_dtype_to)

    # if needed, cast the result array to a different parametrization
    if dtype != resolved_dtype_to:
        cast_to = get_cast_impl((type(resolved_dtype_to), type(resolved_dtype_to)))
        arr = cast_to.do_cast(arr, dtype)

    return arr

This is a quite complicated API (and also quite different from other interfaces we currently have in pandas with the class), and maybe too complicated for what we need. But some nice things about it:

While in the code snippet above the cast implementation is a class, this could maybe also be simplified to registering cast functions for given dtype (kind of function dispatch based on the (source, target) dtypes).

Simplified cast from/to ExtensionDtype methods

Something simpler, but still with the possibility to have both the target dtype as source dtype implement the actual cast, would be to add two methods for this to the ExtensionDtype class:

class ExtensionDtype:

    def _cast_from(self, array: ExtensionArray, dtype: ExtensionDtype, **kwargs) -> ExtensionArray | None:
        """cast from this dtype to another dtype"""
        if isinstance(dtype, IntegerDtype):
            return astype_to_int(array, dtype)
        elif isinstance(dtype, StringDtype):
            result = to_string(array)
            # calling out to the string dtype's cast to ensure we have the exact dtype parametrization
            return result.astype(dtype, copy=False)
        elif ...

        return None

    def _cast_to(self, array: ExtensionArray, **kwargs) -> ExtensionArray | None:
        """cast from another dtype to this dtype"""
        ...
        return None

Those two methods either return a resulting array if they know how to cast (i.e. if they are the dtype that implements the cast) or otherwise they return None. And then, the actual astype method could also be implemented generically as:

def astype(arr, dtype, **kwargs):

    # does the target dtype know how to cast to it?
    result = dtype._cast_to(arr, **kwargs)
    if result is not None:
        return result

    # else try if the array itself knows how to cast to other
    result = arr.dtype._cast_from(arr, dtype, **kwargs)
    if result is not None:
        return result

    # fall back to from_sequence
    return dtype.construct_array()._from_sequence(arr, dtype, **kwargs)

This is a bit simpler than the numpy-like CastImplementation mechanism, and might be sufficient for pandas. Note however that while the actual astype here looks simpler than in the version above, this is also because some of the complexity is moved into _cast_from/_cast_to of each dtype class. For example, the case for parametrized dtypes that I mentioned above now needs to be handled manually in each _cast_from.

Formalizing the current situation

As @jbrockmendel brought up just above, formalizing what we currently already do basically in practice for most dtypes is also an option:

For 1), ATM the base class EA.astype dispatches to _from_sequence, so de facto I think we've landed on a Reasonable Default (tm). Is there still a need for something like a negotiation protocol? I move that we declare this problem solved until a compelling use case is spotted.

In practice, this means that every ExtensionArray subclass needs to define its own astype starting with a couple of checks for dtypes it knows how to handle, and otherwise let the target dtype do it:

class ExtensionArray:

    def astype(self, dtype: ExtensionDtype, **kwargs) -> ExtensionArray:
        # first handle the cases where the array itself knows how to cast
        # to the target dtype
        if isinstance(dtype, ...):
            return ...
        elif ...

        # if we don't know ourselves, leave it to the target dtype
        return dtype.construct_array()._from_sequence(self, dtype)

(the above is leaving out the fact that we actually convert to numpy.array for numpy target dtype, but so just illustrative for the flow if handling extension dtypes)

If we would decide this is Good Enough, then I would argue for using a different method than _from_sequence, but specific for this use case (and can still use _from_sequence in the base class implementation for this for back compat), in case an ExtensionDtype would like to distinguish behaviour for casting. Although this is maybe then just the ExtensionDtype._cast_to from the proposal above. And if you put the initial if/else block in ExtensionDtype._cast_from for code organization, then you basically have the same as above.

But it is indeed good to ask ourselves: is a negotiation protocol actually needed? (apart from a single _from_sequence-like fallback method)

I think the main use case I can think about now for the more complicated interface (the first proposal with CastImplementation) is the nicety about not having to know each parametrization of dtypes (the .astype("string[pyarrow]") example). But I don't know if that is sufficient to warrant an actual more complex protocol.

Dr-Irv commented 2 years ago

In practice, this means that every ExtensionArray subclass needs to define its own astype starting with a couple of checks for dtypes it knows how to handle, and otherwise let the target dtype do it:

If we would decide this is Good Enough, then I would argue for using a different method than _from_sequence, but specific for this use case (and can still use _from_sequence in the base class implementation for this for back compat), in case an ExtensionDtype would like to distinguish behaviour for casting. Although this is maybe then just the ExtensionDtype._cast_to from the proposal above. And if you put the initial if/else block in ExtensionDtype._cast_from for code organization, then you basically have the same as above.

I think it is "good enough".

You also have to consider what happens for user-implemented extension dtypes. If we just require the extension dtypes to implement _cast_to(), meaning that if you call Series.astype(MyExtensionDtype), then pandas just calls _cast_to() under the hood, and then the responsibility of converting from any dtype to the user-implemented dtype is delegated to the extension dtype implementation. _cast_to() can convert dtypes it knows how to handles, and raise an exception for ones it doesn't. For example, imagine two user-implemented extension dtypes MyExtensionDtype_1 and MyExtensionDtype_2. Maybe it is easy to go from MyExtensionDtype_1 to MyExtensionDtype_2 but not easy to go from MyExtensionDtype_2 to MyExtensionDtype_1. The implementations of _cast_to() for each of those classes decide which types they can convert and which types they can't.

I don't think you then need to have _cast_from() at all (or at least I don't see the benefit). For me, if I'm implementing an extension dtype, I don't want to think about how to convert from my type to all the other types that might be out there. But I'm willing to think about how to convert specific types to my extension dtypes.

seberg commented 2 years ago

I will note one two things about my API for NumPy.

But, while I think my API choice is awesome ;), I am not sure which parts fit to pandas ExtensionArrays. The important point is that NumPy provides the container and DTypes are designed completely container agnostic. So, for example, NumPy has to allocate the result array, but the ExtensionArray approach ties the extension DType to a specialized container, so it naturally delegates this.

The point of having both directions available (in some form), is that if you write a bfloat16 and I write an arbitrary_precision_float, I can define the cast to and from your bfloat16. You don't have to know about my extension dtype even existing. Effectively, NumPy does this using a simple dictionary:

MyExtensionDType_1._casts_to[MyExtensionDType_2](values)

Right now the API is a bit restrictive about adding new casts to the dictionary, but effectively that is what defining a cast means: Adding an ArrayMethod/CastingImpl to such an internal dictionary. I think a simple callable there will do for you, so long as you are in Python, you can always decide to extend it with methods (replace it with a CastingImpl class).

jorisvandenbossche commented 2 years ago

I don't think you then need to have _cast_from() at all (or at least I don't see the benefit). For me, if I'm implementing an extension dtype, I don't want to think about how to convert from my type to all the other types that might be out there. But I'm willing to think about how to convert specific types to my extension dtypes.

@Dr-Irv I think it's important to be able to control the cast from your dtype to any other dtype as well, and not only to your dtype. Sebastian gave an example of doing a custom float dtype. And another example that I often use to think this through is the GeometryDtype from geopandas. I mentioned above the cast between geometry and string, and in this case as extension dtype author I want to be able to control both the string -> geometry and geometry -> string casts. While the StringDtype indeed can cast any other dtype to string ("In the case of StringDtype, because the elements of an EA should have __str__ implemented, the conversion is straightforward. Asking each EA to "know" about StringDtype seems like overkill. you mentioned a long time ago above https://github.com/pandas-dev/pandas/issues/22384#issuecomment-578774638), in this specific case geopandas is much better suited to do the conversion to strings (we have a ufunc that does such conversion, which is much faster as calling str(..) on each element).

(now, as long as the EA can override astype and add some logic to cast from itself to any dtype before calling the other's dtype _cast_to, then you of course don't necessarily "need" _cast_from, but the comment above is mostly about the idea that an ExtensionDtype wants to control this, regardless of whether we let them do it in _cast_from or astype itself)

But what complicates the "asking each EA to know about StringDtype" is that there is not a single StringDtype, but that it is parametrized (python or pyarrow storage). And the nice thing about the first (numpy-like) proposal is that it doesn't require the EA to know about every StringDtype parametrization, but allows to only know about one.

But, while I think my API choice is awesome ;), I am not sure which parts fit to pandas

@seberg I like your API as well :), but it's indeed not necessarily needed for pandas. In numpy it operates on a lower level, interacts with ufuncs, etc. And as you mention, we don't only have the dtype, but also the array container (EA subclass) which is already supposed to do hold some of this logic (eg if casting is needed in an operation, such as arr1 + arr2, this is currently expected to be handled in your EA.__add__)

jbrockmendel commented 2 years ago

I think it's important to be able to control the cast from your dtype to any other dtype as well, and not only to your dtype

I think the existing pattern discussed above can do this with the exception of a corner case I'll describe below. I'm curious if anyone thinks this corner case is A Big Deal.

Suppose Sally is implementing FooDtype/FooArray and wants to specify the behavior of BarDtype/BarArray.astype (of which she is not the author). Suppose that BarArray.astype uses the pattern discussed above of the form:

def astype(self, dtype, copy):
     if self.has_special_logic_for_dtype(dtype):
          [...]
     else:
         return dtype._construct_array_type()._from_sequence(self, dtype=dtype, copy=copy)

Then in FooArray._from_sequence she can do:

def _from_sequence(cls, scalars, dtype, copy):
    if isinstance(dtype, BarDtype):
        return my_special_logic(...)
    [...]

So Sally can accomplish the two-way control using the existing pattern except for if BarArray.has_special_logic_for_dtype already has special logic for converting to FooDtype. I guess this is relevant for if Sally thinks BarArray.astype is specifically wrong? My impression is that the relevant cases are more about when BarArray.astype doesn't know anything about FooDtype. Am I missing something here?

Dr-Irv commented 2 years ago

(now, as long as the EA can override astype and add some logic to cast from itself to any dtype before calling the other's dtype _cast_to, then you of course don't necessarily "need" _cast_from, but the comment above is mostly about the idea that an ExtensionDtype wants to control this, regardless of whether we let them do it in _cast_from or astype itself)

yes, that makes sense. I now see that the symmetry of _cast_from() and _cast_to() makes sense.

Dr-Irv commented 2 years ago

My impression is that the relevant cases are more about when BarArray.astype doesn't know anything about FooDtype.

So if Sally implements FooDtype._cast_to(), then BarArray.astype(any_dtype) can just call any_dtype._cast_to() without having to even know that FooDtype exists. Right?

jbrockmendel commented 2 years ago

So if Sally implements FooDtype._cast_to(), then BarArray.astype(any_dtype) can just call any_dtype._cast_to() without having to even know that FooDtype exists. Right?

That's my understanding of _cast_to, yes. My point was that this behavior already exists with the existing pattern with _from_sequence taking the place of _cast_to.