shashankvemuri / Finance

150+ quantitative finance Python programs to help you gather, manipulate, and analyze stock market data
MIT License
2.2k stars 64 forks source link

Potential performance issue: .to_datetime in pandas below 2.1 #32

Open TendouArisu opened 9 months ago

TendouArisu commented 9 months ago

Issue Description:

Hello. I have discovered a performance degradation in the .to_datetime function of pandas version 2.0.3. .to_datetime doesn't recognize Arrow date time dtypes and converts them again. And I noticed that some parts of the repository depend on the pandas version 2.0.3. I found that many files such as stock_analysis/sp500_cot_sentiment_analysis.py, technical_indicators/candle_abs_returns.py used the influenced api. There may be more files using the influenced api. I am not sure whether this performance problem in pandas will affect this repository. Here are some discussions on pandas GitHub related to this issue, including #52545 and #53301.

Reproducible Example in pandas

In [3]: dr = pd.date_range("2019-12-31", periods=1_000_000, freq="s").astype(pd.ArrowDtype(pa.timestamp(unit="ns")))

In [4]: %timeit pd.to_datetime(dr)
1.84 s ± 8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance. Any other workarounds or solutions would be greatly appreciated. Thank you!