Open mroeschke opened 1 year ago
is the idea that making these configurable will help in bug hunting? or more of an "anything that can be configured should be configurable"? Because the latter im wary of.
Personally, more to help with bug hunting, but I also think it's a better user experience if behavior doesn't change based on a silent heuristic. Additionally, I've been diving into slow tests recently, and a lot of the slow tests have to generate large data to trip and test the heuristic path.
There are several places where pandas has hidden heuristics/thresholds dictating certain behavior that is not immediately obvious or configurable to the user. IIRC, there have been bugs in
rolling
andto_datetime
where buggy behavior was encountered when data had a particular value or the data was a certain size for example which can be hard to diagnose.Ideally we should:
CSV reading tokenizer chunksize https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/parsers.pyx#L119
CSV line buffer size https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/parsers.pyx#L587
Number of elements when to auto use numexpr https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/computation/expressions.py#L42
TDA iter chunk size processing https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/arrays/timedeltas.py#L387
Something pytables related https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/computation/pytables.py#L101 https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/pytables.py#L1887
Number of element to automatically use caching in to_datetime https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/tools/datetimes.py#L124
Chunk size to use when writing csv https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/formats/csvs.py#L166
Number of regexes to store when time parsing https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/tslibs/strptime.pyx#L576
Rank tolerance https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/algos.pyx#L61
isin algo determination https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/algorithms.py#L521
Value formatting https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/formats/format.py#L1562
Number of elements to populate hash table https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/index.pyx#L99