Open wence- opened 10 months ago
Another question that comes to mind is whether these same rules should be applied to binops (there's a different set of rules in pandas for merges compared to binops AFAICT).
Here's what I concretely propose for integer promotions:
+----+ +-----+ +-----+ +-----+
| i8 |--->| i16 |--->| i32 |--->| i64 |
+----+ +--^--+ +--^--+ +--^--+
/ / /
/---' /---' /---'
/---' /---' /---'
+--+-+ +---+-+ +---+-+ +-----+
| u8 |--->| u16 |--->| u32 |--->| u64 |
+----+ +-----+ +-----+ +-----+
For floating point (noting that we don't have float 16 right now):
+-----+ +-----+ +-----+
| f16 |--->| f32 |--->| f64 |
+-----+ +-----+ +-----+
The thorny one is unifying these two. If you send (for example), the pair (i16, f32)
to f32
, and the pair (u16, f32)
to f32
, which is safe since you lose no data, then that breaks the lattice property of the promotion scheme with the consequence that type promotion is no longer associative. ((i16, u16), f32) -> (i32, f32) -> f64
, but (i16, (u16, f32)) -> (i16, f32) -> f32
.
Jax solves this problem by letting arbitrary width float win over arbitrary width integer, so that (i64, f16) -> f16
. But I don't think that's a reasonable solution for cudf.
We can remove the lattice property and have integer to float conversions where they are lossless. The issue is that a sequence of merges might or might not raise depending on their order (and join reordering is a common optimisation).
Consequently, having thought harder, I am inclined to raise for joins between integral and float keys, and promote safely using the lattices above for same-kind keys.
Just for comparison, pandas essentially dispatches to np.result_type
to determine promotion. This logic generally does not apply when an element is trying to be set into an existing typed column.
In [1]: import numpy as np
In [2]: types = [np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64, np.float32, np.float64]
In [3]: np.__version__
Out[3]: '1.26.0'
In [4]: import itertools
In [5]: for typ1, typ2 in itertools.combinations(types, 2):
...: print(f"{np.dtype(typ1)} + {np.dtype(typ2)} -> {np.result_type(typ1, typ2)}")
...:
int8 + int16 -> int16
int8 + int32 -> int32
int8 + int64 -> int64
int8 + uint8 -> int16
int8 + uint16 -> int32
int8 + uint32 -> int64
int8 + uint64 -> float64
int8 + float32 -> float32
int8 + float64 -> float64
int16 + int32 -> int32
int16 + int64 -> int64
int16 + uint8 -> int16
int16 + uint16 -> int32
int16 + uint32 -> int64
int16 + uint64 -> float64
int16 + float32 -> float32
int16 + float64 -> float64
int32 + int64 -> int64
int32 + uint8 -> int32
int32 + uint16 -> int32
int32 + uint32 -> int64
int32 + uint64 -> float64
int32 + float32 -> float64
int32 + float64 -> float64
int64 + uint8 -> int64
int64 + uint16 -> int64
int64 + uint32 -> int64
int64 + uint64 -> float64
int64 + float32 -> float64
int64 + float64 -> float64
uint8 + uint16 -> uint16
uint8 + uint32 -> uint32
uint8 + uint64 -> uint64
uint8 + float32 -> float32
uint8 + float64 -> float64
uint16 + uint32 -> uint32
uint16 + uint64 -> uint64
uint16 + float32 -> float32
uint16 + float64 -> float64
uint32 + uint64 -> uint64
uint32 + float32 -> float64
uint32 + float64 -> float64
uint64 + float32 -> float64
uint64 + float64 -> float64
float32 + float64 -> float64
Is your feature request related to a problem? Please describe.
To date, cudf has attempted to match pandas semantics when matching join keys in a merge. libcudf does not perform merges between mismatching table dtypes. Consequently, the first step of a merge in cudf is to determine a "common" dtype for each pair of columns used as keys in the merge.
The pandas rules are mostly (though not completely since there is some under the table work that happens in the join algorithm) encoded in https://github.com/pandas-dev/pandas/blob/f7c73a5f1aaf0724598e60c0cc5732604ec842a8/pandas/core/reshape/merge.py#L1340
There are a few problems when trying to match these in cudf:
object
column for example)Moreover, there are other, correctness, problems. The current type promotion rules admit lossy conversions that can result in false positive matches in merges.
Example:
Pandas is also susceptible to this, but produces a different wrong result.
I would like to tighten up the rules in cudf, so that it is impossible for the user to get a "surprising" result without some explicit intervention on their behalf. We would also try and match pandas more closely where that is possible, but my preference is to be correct in a subset of cases over dubiously correct in a larger set.
Describe the solution you'd like
There are, I think, three levels of things we could do:
I would like to go for (2). (1) is easiest; (3) is difficult, probably a moving target and can result in false positives without the user explicitly "requesting" them.
With cudf-pandas (2), I think, skates the line between ease of use and correctness reasonably well. We can run as much on the GPU as possible and raise (possibly providing a warning in pandas-compat mode) with fallback to CPU. When using cudf directly, users will hopefully be willing to accept a few more edge cases in the name of consistency.
Concretely this would mean:
For numeric types, that means that we would only promote pairs of types where there exists a wider type whose values are uniquely and unambiguously mapped onto from the narrower types.
For example
(int32, uint32) -> int64
would be allowed, but merging a pair(int32, uint64)
would raise (since there is no signed 128bit int that we could use). Similarly, we would safely be able to promote(intX, floatY)
pairs (and similarly withuintX
) as long as the integer type is 32 or fewer bits wide[^5].[^1]: I could also be convinced to unwrap and go round again, but that would lose information about the categorical nature of the inputs [^2]: Pandas behaviour in this case depends on whether the left or right key is categorical (and which merge type it is): it casts the non-categorical to object, and the categorical to its underlying dtype, then imperfectly goes through its matching process again [^3]: I haven't looked at what pandas does here, but I guess the other thing one could do is promote when one can losslessly convert [^4]: See, for example https://jax.readthedocs.io/en/latest/jep/9407-type-promotion.html though I disagree with their approach of selecting a "weak" float64 as the least upper bound for
(int64, uint64)
[^5]: Merging between float and int columns is kind of weird, so I could also be convinced to raise when merging between mismatching numeric kinds.