rapidsai / build-planning

Tracking for RAPIDS-wide build tasks
https://github.com/rapidsai
0 stars 4 forks source link

Add validation on wheel file characteristics in CI #110

Open jameslamb opened 6 hours ago

jameslamb commented 6 hours ago

Description

As we start to publish more RAPIDS wheels directly to https://pypi.org/, it'll be important to validate the following characteristics:

These and any other non-functional characteristics of wheels that can be validated easily, quickly, and programmatically should be validated in CI.

Benefits of this work

Acceptance Criteria

Approach

Such checks should run in wheel-building CI jobs, on PRs... they don't require GPUs, and should run so quickly that it isn't worth the spin-up time and cost of a dedicated wheel-validation CI job.

This could be accomplished with something like the following in wheel-build scripts, after the wheel file is finalized (e.g. has gone through auditwheel repair).

# check README formatting
twine check --strict dist/*

# assert size, and print a summary of contents
pydistcheck \
  --inspect \
  --select 'distro-too-large-compressed' \
  --max-allowed-size-compressed '100M' \
  ./delete-me/*
example of this using 24.12 libcudf-cu12 nightlies (click me) On an x86_64 linux system: ```shell mkdir -p ./delete-me pip download \ --extra-index-url https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ \ --no-deps \ -d ./delete-me \ 'libcudf-cu12==24.12.*,>=0.0.0a0' ``` Checking the formatting. ```shell twine check --strict ./delete-me/* ``` ```text Checking ./delete-me/libcudf_cu12-24.12.0a223-py3-none-manylinux_2_28_x86_64.whl: PASSED ``` And checking the size (using `5M` just to show how this would fail in CI if the wheels got too big): ```shell pydistcheck \ --inspect \ --select 'distro-too-large-compressed' \ --max-allowed-size-compressed '5M' \ ./delete-me/* ``` ```text ==================== running pydistcheck ==================== checking './delete-me/libcudf_cu12-24.12.0a223-py3-none-manylinux_2_28_x86_64.whl' ----- package inspection summary ----- file size * compressed size: 0.4G * uncompressed size: 0.7G * compression space saving: 35.2% contents * directories: 163 * files: 1974 (2 compiled) size by extension * .so - 0.6G (97.0%) * .h - 6.7M (1.0%) * no-extension - 5.3M (0.8%) * .cuh - 3.8M (0.6%) * .hpp - 2.2M (0.3%) * .a - 1.2M (0.2%) * .inl - 0.8M (0.1%) * .cmake - 0.1M (0.0%) * .md - 8.3K (0.0%) * .py - 3.6K (0.0%) * .pc - 0.2K (0.0%) * .txt - 34.0B (0.0%) largest files * (0.6G) libcudf/lib64/libcudf.so * (3.8M) libcudf/bin/flatc * (1.1M) libcudf/lib64/libflatbuffers.a * (0.5M) libcudf/include/libcudf/rapids/libcudacxx/cuda/std/__atomic/functions/cuda_ptx_generated.h * (0.2M) libcudf_cu12-24.12.0a223.dist-info/RECORD ------------ check results ----------- 1. [distro-too-large-compressed] Compressed size 0.4G is larger than the allowed size (5.0M). errors found while checking: 1 ==================== done running pydistcheck =============== ``` ```shell echo $? # 1 ```

This should be rolled out to all RAPIDS projects producing wheels, starting with those targeting publishing to https://pypi.org/.

Notes

Disclaimer

pydistcheck is something I (@jameslamb) created and am currently the sole contributor on, so I'm biased about it being the best tool for this purpose.

The narrow goal here, "assert files are not too big", could be accomplished other ways... like unzipping the wheel and using some mix of stat / du / find -size, or with a Python script.

I think the summaries pydistcheck generates (see "Approach" section) and all the others things we could ask it to check in the future (docs link) make it preferable to those approaches, but I shouldn't be the one to decide that.

bdice commented 5 hours ago

I took a brief look at pydistcheck and would support adopting it. It seems like a great tool for this job. It's also good to lean into your expertise in this area. 👍

vyasr commented 4 hours ago

Yup I'm happy with trying out pydistcheck. Maybe we should do a group review of it in a future build-infra team meeting to make sure we're all aligned. Adding a size check to CI is definitely critical so that we know when someone tries to add a feature that suddenly bloats binary size.