rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 894 forks source link

[FEA] Tighten up promotion when merging with non-equal key column dtypes #14594

Open wence- opened 10 months ago

wence- commented 10 months ago

Is your feature request related to a problem? Please describe.

To date, cudf has attempted to match pandas semantics when matching join keys in a merge. libcudf does not perform merges between mismatching table dtypes. Consequently, the first step of a merge in cudf is to determine a "common" dtype for each pair of columns used as keys in the merge.

The pandas rules are mostly (though not completely since there is some under the table work that happens in the join algorithm) encoded in https://github.com/pandas-dev/pandas/blob/f7c73a5f1aaf0724598e60c0cc5732604ec842a8/pandas/core/reshape/merge.py#L1340

There are a few problems when trying to match these in cudf:

Moreover, there are other, correctness, problems. The current type promotion rules admit lossy conversions that can result in false positive matches in merges.

Example:

left = cudf.DataFrame({"key": [1, 2, 2**53]})
right = cudf.DataFrame({"key": [2**53 + 1, 10]})
right["key"] = right.key.astype("uint64")
left.merge(right, on="key", how="inner")
#            key
# 0  9.007199e+15
left
#                key
# 0                 1
# 1                 2
# 2  9007199254740992
right
#                key
# 0  9007199254740993
# 1                10

Pandas is also susceptible to this, but produces a different wrong result.

I would like to tighten up the rules in cudf, so that it is impossible for the user to get a "surprising" result without some explicit intervention on their behalf. We would also try and match pandas more closely where that is possible, but my preference is to be correct in a subset of cases over dubiously correct in a larger set.

Describe the solution you'd like

There are, I think, three levels of things we could do:

  1. Push the burden of dtype matching completely on to the user: complain (raise) if merge keys do not match dtypes exactly
  2. Promote keys whose dtypes allow so safely (without needing to inspect values), and raise for cases where that is not possible. The user can still perform the merge by intentionally casting to matching types. But then they must know that it is safe.
  3. Try and match pandas promotions as closely as possible and accept that there might be false positives.

I would like to go for (2). (1) is easiest; (3) is difficult, probably a moving target and can result in false positives without the user explicitly "requesting" them.

With cudf-pandas (2), I think, skates the line between ease of use and correctness reasonably well. We can run as much on the GPU as possible and raise (possibly providing a warning in pandas-compat mode) with fallback to CPU. When using cudf directly, users will hopefully be willing to accept a few more edge cases in the name of consistency.

Concretely this would mean:

For numeric types, that means that we would only promote pairs of types where there exists a wider type whose values are uniquely and unambiguously mapped onto from the narrower types.

For example (int32, uint32) -> int64 would be allowed, but merging a pair (int32, uint64) would raise (since there is no signed 128bit int that we could use). Similarly, we would safely be able to promote (intX, floatY) pairs (and similarly with uintX) as long as the integer type is 32 or fewer bits wide[^5].

[^1]: I could also be convinced to unwrap and go round again, but that would lose information about the categorical nature of the inputs [^2]: Pandas behaviour in this case depends on whether the left or right key is categorical (and which merge type it is): it casts the non-categorical to object, and the categorical to its underlying dtype, then imperfectly goes through its matching process again [^3]: I haven't looked at what pandas does here, but I guess the other thing one could do is promote when one can losslessly convert [^4]: See, for example https://jax.readthedocs.io/en/latest/jep/9407-type-promotion.html though I disagree with their approach of selecting a "weak" float64 as the least upper bound for (int64, uint64) [^5]: Merging between float and int columns is kind of weird, so I could also be convinced to raise when merging between mismatching numeric kinds.

wence- commented 10 months ago

Another question that comes to mind is whether these same rules should be applied to binops (there's a different set of rules in pandas for merges compared to binops AFAICT).

wence- commented 10 months ago

Here's what I concretely propose for integer promotions:

      +----+    +-----+    +-----+    +-----+
      | i8 |--->| i16 |--->| i32 |--->| i64 |
      +----+    +--^--+    +--^--+    +--^--+
                  /          /          /
              /---'      /---'      /---'
          /---'      /---'      /---'        
      +--+-+    +---+-+    +---+-+    +-----+
      | u8 |--->| u16 |--->| u32 |--->| u64 |
      +----+    +-----+    +-----+    +-----+

For floating point (noting that we don't have float 16 right now):

      +-----+    +-----+    +-----+
      | f16 |--->| f32 |--->| f64 |
      +-----+    +-----+    +-----+

The thorny one is unifying these two. If you send (for example), the pair (i16, f32) to f32, and the pair (u16, f32) to f32, which is safe since you lose no data, then that breaks the lattice property of the promotion scheme with the consequence that type promotion is no longer associative. ((i16, u16), f32) -> (i32, f32) -> f64, but (i16, (u16, f32)) -> (i16, f32) -> f32.

Jax solves this problem by letting arbitrary width float win over arbitrary width integer, so that (i64, f16) -> f16. But I don't think that's a reasonable solution for cudf.

We can remove the lattice property and have integer to float conversions where they are lossless. The issue is that a sequence of merges might or might not raise depending on their order (and join reordering is a common optimisation).

Consequently, having thought harder, I am inclined to raise for joins between integral and float keys, and promote safely using the lattices above for same-kind keys.

mroeschke commented 10 months ago

Just for comparison, pandas essentially dispatches to np.result_type to determine promotion. This logic generally does not apply when an element is trying to be set into an existing typed column.

In [1]: import numpy as np

In [2]: types = [np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64, np.float32, np.float64]

In [3]: np.__version__
Out[3]: '1.26.0'

In [4]: import itertools

In [5]: for typ1, typ2 in itertools.combinations(types, 2):
    ...:     print(f"{np.dtype(typ1)} + {np.dtype(typ2)} -> {np.result_type(typ1, typ2)}")
    ...: 
int8 + int16 -> int16
int8 + int32 -> int32
int8 + int64 -> int64
int8 + uint8 -> int16
int8 + uint16 -> int32
int8 + uint32 -> int64
int8 + uint64 -> float64
int8 + float32 -> float32
int8 + float64 -> float64
int16 + int32 -> int32
int16 + int64 -> int64
int16 + uint8 -> int16
int16 + uint16 -> int32
int16 + uint32 -> int64
int16 + uint64 -> float64
int16 + float32 -> float32
int16 + float64 -> float64
int32 + int64 -> int64
int32 + uint8 -> int32
int32 + uint16 -> int32
int32 + uint32 -> int64
int32 + uint64 -> float64
int32 + float32 -> float64
int32 + float64 -> float64
int64 + uint8 -> int64
int64 + uint16 -> int64
int64 + uint32 -> int64
int64 + uint64 -> float64
int64 + float32 -> float64
int64 + float64 -> float64
uint8 + uint16 -> uint16
uint8 + uint32 -> uint32
uint8 + uint64 -> uint64
uint8 + float32 -> float32
uint8 + float64 -> float64
uint16 + uint32 -> uint32
uint16 + uint64 -> uint64
uint16 + float32 -> float32
uint16 + float64 -> float64
uint32 + uint64 -> uint64
uint32 + float32 -> float64
uint32 + float64 -> float64
uint64 + float32 -> float64
uint64 + float64 -> float64
float32 + float64 -> float64