rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 894 forks source link

[ENH] Use `strict=True` argument to `zip` once py39 support is dropped #15835

Open wence- opened 5 months ago

wence- commented 5 months ago

In many places in the cudf code we zip two (or more) iterables together with the assumption/precondition that they are all of equal length. The pattern is (approximately):

names: list[str]
columns: list[Column]
data: dict[str, Column] = dict(zip(names, columns))

This has, by design of zip, a potential problem lurking in that if the two inputs are not of equal length, we only keep the first min(len(names), len(columns)) objects.

To avoid check this at user-input boundaries we need to be careful to check (and then raise if not) that the inputs we are zipping do have equal length. This is easy to forget.

Python 3.10 introduces a new keyword-argument to zip, strict, which can be used to assert that the inputs have the same length. We should consider using this across the code-base when no longer supporting Python 3.9.

mroeschke commented 5 months ago

+1. Also there's a ruff rule that would allow us to enforce this https://docs.astral.sh/ruff/rules/zip-without-explicit-strict/

vyasr commented 1 month ago

https://github.com/rapidsai/cudf/pull/16637 partially addressed this. There remains more work to do since that PR only changed this in the cudf-polars package, whereas we still need to make this change throughout the rest of the repo.