Closed jbrockmendel closed 1 year ago
HEY, i am new. can u guide me for doing this please. i want to do this
Welcome @harsimran44! We suggest that newcomers look for issues with the "good first issue" label.
Many tests have unreasonably large parametrizations
This is an issue in groupby as well, especially for some tests I've added or directed others to add. The issue is that there are so many combinations:
.sum()
vs agg, apply, or transform)A good example demonstrating this is tests/groupby/test_raises.py
that just makes sure we are raising correctly on bad dtypes. This has over 10k tests. Perhaps some of it can be whittled down, but I think the bulk of it cannot without losing coverage.
Another example is tests/groupby/test_groupby_dropna.py::test_no_sort_keep_na
. This has 5832 parametrizations. The original test in #46584 only did one sequence of values, and induced a pretty bad regression since the op failed if the NA value occurred in a different place. I added a parametrization that tested all combinations of NA/non-NA values of length 4 (81 combinations).
One thing I've been wondering is if in this case (and others like it) that we should replace "all possible sequences" with a randomly chosen sequence or small (5?) set of sequences. This would be better for performance, but would also possibly introduce a flaky test.
With -n0
, on my machine the test takes 10s.
I don't think having large parameterized test in itself is a bad thing, but I suspect those test possibly render other (older) tests redundant.
One thing I've been wondering is if in this case (and others like it) that we should replace "all possible sequences" with a randomly chosen sequence or small (5?) set of sequences.
IMO I dislike randomly generated data in tests for the flakiness you mention; it makes PR review a pain. I would rather be thorough or use hypothesis in these cases.
I'm of mixed mind on randomly generated test data, but trending an an "anti" direction.
For tz_(naive|aware)_fixture in particular, we could in principle use all of pytz.all_timezones (596 entries locally) and presumably multiply that by 3 to include dateutil/zoneinfo counterparts. And arbitrarily many fixed offsets. Clearly we're not going to add all of those to the fixtures. A few years back I toyed with the idea of using a randomized subset. At this point I could be convinced, but don't see enough value here to pursue it myself.
One place where I've poked at generating randomized data is generating "any" Index (possibly for use in tests.indexes.test_any_index). I think there are a lot of corner cases we likely miss in the status quo (increasing, decreasing, non-monotonic, mixed-Xs-and-Ys, values near implementation bounds, int64 values near float64 rounding bounds, infs, ...). It's been a while since I looked at this. IIRC I tried implementing something with hypothesis and found a) enumerating all the cases I wanted was tough, b) the runtime would be really rough, c) there are relatively few tests where we want genuinely any index as opposed to "any datetime index", "any datetimeIndex with a freq", ... and enumerating all of those any_foos would be annoying.
Closing as "no action"
The runtime of the test suite has roughly doubled in the past year. A lot of this reflects new features (pyarrow dtypes, non-nanosecond dt64/td64) or just improved test coverage, so this isn't a bad thing. But it is a pain point for my workflow and I expect not a coincidence that we see so many CI timeouts (xref #51409).
Some observations:
1) Just collecting the test suite has a footprint of 1-2GB. And IIUC test collection doesn't parallelize, so doing
pytest -n=many
doesn't go very far unless you've got loads of memory. 2) xfails are really expensive since pytest renders the tracebacks even when it doesn't print them. 3) Using --durations=20000 shows many tests as over .01 seconds, but then re-running subsets of those tests in isolation they are usually faster. 4) assert_frame_equal is slow. profiling the pyarrow test_unstack tests takes 27s, of which 6.6 is in unstack and 7.7 is in assert_frame_equal (for very small frames AFAICT) 5) test call overhead is hard to measure. when doing the test_unstack profiling the total runtime is 27s, but only 20.4 show up in the profile output. of that, 19.4 is inside test_unstack. So the overhead is either 5% or 18% depending on which denominator we use. 6) Many tests have unreasonably large parametrizations. test_dt64arr_add_sub_DateOffsets doesn't need 1056 variants. (many of the worst offenders use tz_naive_fixture, so my bad!)* Some reported timings are with --profile enabled, which can throw off the big-picture numbers. Take grains of salt where appropriate.
cc @mroeschke @phofl