pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

API: Avoid Hidden numeric heuristics #53781

Open mroeschke opened 1 year ago

mroeschke commented 1 year ago

There are several places where pandas has hidden heuristics/thresholds dictating certain behavior that is not immediately obvious or configurable to the user. IIRC, there have been bugs in rolling and to_datetime where buggy behavior was encountered when data had a particular value or the data was a certain size for example which can be hard to diagnose.

Ideally we should:

  1. Not change behavior due to some data characteristic introspection
  2. At lease expose the option to the user to control the heuristic

CSV reading tokenizer chunksize https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/parsers.pyx#L119

CSV line buffer size https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/parsers.pyx#L587

Number of elements when to auto use numexpr https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/computation/expressions.py#L42

TDA iter chunk size processing https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/arrays/timedeltas.py#L387

Something pytables related https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/computation/pytables.py#L101 https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/pytables.py#L1887

Number of element to automatically use caching in to_datetime https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/tools/datetimes.py#L124

Chunk size to use when writing csv https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/formats/csvs.py#L166

Number of regexes to store when time parsing https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/tslibs/strptime.pyx#L576

Rank tolerance https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/algos.pyx#L61

isin algo determination https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/core/algorithms.py#L521

Value formatting https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/io/formats/format.py#L1562

Number of elements to populate hash table https://github.com/pandas-dev/pandas/blob/bb0403b25b1935a608b324a93a483bd22e6c43d3/pandas/_libs/index.pyx#L99

jbrockmendel commented 1 year ago

is the idea that making these configurable will help in bug hunting? or more of an "anything that can be configured should be configurable"? Because the latter im wary of.

mroeschke commented 1 year ago

Personally, more to help with bug hunting, but I also think it's a better user experience if behavior doesn't change based on a silent heuristic. Additionally, I've been diving into slow tests recently, and a lot of the slow tests have to generate large data to trip and test the heuristic path.